Tokenizers

Whitespace

"Dot space" redirects here. For the animated film, see Dot in Space."␣" redirects here. It is not to be confused with .

In computer programming, whitespace is any character or series of characters that represent horizontal or vertical space in typography. When rendered, a whitespace character does not correspond to a visible mark, but typically does occupy an area on a page. For example, the common whitespace symbol U+0020 SPACE (also ASCII 32) represents a blank space punctuation character in text, used as a word divider in Western scripts.

Wink-tokenizer

By default, Datasaur uses the wink-tokenizer to separate text into individual tokens, which are grouped by lines of sentences.

"Wink is a family of open source packages for Statistical Analysis, Natural Language Processing and Machine Learning in NodeJS." (source: https://winkjs.org/about.html)

How does it work? It works by using pattern matching via regular expressions (regex). The list of regexes used by wink-tokenizer can be seen in detail at https://winkjs.org/wink-tokenizer/Tokenizer.html#defineConfig. In summary, it will parse the following by default:

NameDescription

currency

such as $ or £ symbols (r)

email

for example john@acme.com or superman1@gmail.com (e)

emoji

any standard unicode emojis e.g. - ✅ or ❌ -> 1 UTF-16 unit - 😊 or 😂 or 🎉 -> 2 UTF-16 units - 🇮🇩 or 🇺🇸 -> 4 UTF-16 units (j)

emoticon

common emoticons such as :-) or :D (c)

hashtag

hash tags such as #happy or #followme (h)

number

any integer, decimal number, fractions such as 19, 2.718 or 1/4 and numerals containing ", - / .", for example 12-12-1924 (n)

ordinal

ordinals like 1st, 2nd, 3rd, 4th or 12th or 91st (o)

punctuation

common punctuation such as ? or , ( token becomes fingerprint )

quoted_phrase

any "quoted text" in the sentence. Note: its default value is false. (q)

symbol

for example ~ or + or & or % or / ( token becomes fingerprint )

time

common representation of time such as 4pm or 16:00 hours (t)

mention

@mention as in github or twitter (m)

url

URL such as https://github.com (u)

word

word such as faster or résumé or prévenir (w)

To learn more details about the regex patterns used, you can directly view the source code at https://github.com/winkjs/wink-tokenizer/blob/master/src/wink-tokenizer.js

Playground

You can play around with the tokenizer here.

Last updated