Tokenizers

Whitespace

"Dot space" redirects here. For the animated film, see Dot in Space."␣" redirects here. It is not to be confused with .

In computer programming, whitespace is any character or series of characters that represent horizontal or vertical space in typography. When rendered, a whitespace character does not correspond to a visible mark, but typically does occupy an area on a page. For example, the common whitespace symbol U+0020 SPACE (also ASCII 32) represents a blank space punctuation character in text, used as a word divider in Western scripts.

Wink-tokenizer

By default, Datasaur uses the wink-tokenizer to separate text into individual tokens, which are grouped by lines of sentences.

"Wink is a family of open source packages for Statistical Analysis, Natural Language Processing and Machine Learning in NodeJS." (source: https://winkjs.org/about.html)

How does it work? It works by using pattern matching via regular expressions (regex). The list of regexes used by wink-tokenizer can be seen in detail at https://winkjs.org/wink-tokenizer/Tokenizer.html#defineConfig. In summary, it will parse the following by default:

NameDescription

currency

such as $ or £ symbols (r)

email

for example john@acme.com or superman1@gmail.com (e)

emoji

any standard unicode emojis e.g. 😊 or 😂 or 🎉 (j)

emoticon

common emoticons such as :-) or :D (c)

hashtag

hash tags such as #happy or #followme (h)

number

any integer, decimal number, fractions such as 19, 2.718 or 1/4 and numerals containing ", - / .", for example 12-12-1924 (n)

ordinal

ordinals like 1st, 2nd, 3rd, 4th or 12th or 91st (o)

punctuation

common punctuation such as ? or , ( token becomes fingerprint )

quoted_phrase

any "quoted text" in the sentence. Note: its default value is false. (q)

symbol

for example ~ or + or & or % or / ( token becomes fingerprint )

time

common representation of time such as 4pm or 16:00 hours (t)

mention

@mention as in github or twitter (m)

url

word

word such as faster or résumé or prévenir (w)

To learn more details about the regex patterns used, you can directly view the source code at https://github.com/winkjs/wink-tokenizer/blob/master/src/wink-tokenizer.js

Playground

You can play around with the tokenizer here.

Last updated