Tokenizers

Whitespace

"Dot space" redirects here. For the animated film, see Dot in Spacearrow-up-right."␣" redirects here. It is not to be confused with arrow-up-right.

In computer programmingarrow-up-right, whitespacearrow-up-right is any characterarrow-up-right or series of characters that represent horizontal or vertical spacearrow-up-right in typographyarrow-up-right. When rendered, a whitespace character does not correspond to a visible mark, but typically does occupy an area on a page. For example, the common whitespace symbol U+0020 SPACE (also ASCIIarrow-up-right 32) represents a blank spacearrow-up-right punctuation character, used as a word dividerarrow-up-right in Western scriptsarrow-up-right.

Wink-tokenizer

By default, Datasaur uses the wink-tokenizerarrow-up-right to separate text into individual tokens, which are grouped by lines of sentences.

"Wink is a family of open source packages for Statistical Analysis, Natural Language Processing and Machine Learning in NodeJS." (source: https://winkjs.org/about.htmlarrow-up-right)

How does it work? It works by using pattern matching via regular expressions (regex). The list of regexes used by wink-tokenizer can be seen in detail at https://winkjs.org/wink-tokenizer/Tokenizer.html#defineConfigarrow-up-right. In summary, it will parse the following by default:

Name
Description

currency

such as $ or £ symbols (r)

email

emoji

any standard Unicode emojis, for example: - ✅ or ❌ -> 1 UTF-16 unit - 😊 or 😂 or 🎉 -> 2 UTF-16 units - 🇮🇩 or 🇺🇸 -> 4 UTF-16 units (j)

emoticon

common emoticons such as :-) or :D (c)

hashtag

hash tags such as #happy or #followme (h)

number

any integer, decimal number, fractions such as 19, 2.718 or 1/4 and numerals containing ", - / .", for example 12-12-1924 (n)

ordinal

ordinals like 1st, 2nd, 3rd, 4th or 12th or 91st (o)

punctuation

common punctuation such as ? or , (token becomes fingerprint)

quoted_phrase

any "quoted text" in the sentence. Note: its default value is false. (q)

symbol

for example ~ or + or & or % or / ( token becomes fingerprint )

time

common representation of time such as 4pm or 16:00 hours (t)

mention

@mention as in github or twitter (m)

word

word such as faster or résumé or prévenir (w)

To learn more details about the regex patterns used, you can directly view the source code at https://github.com/winkjs/wink-tokenizer/blob/master/src/wink-tokenizer.jsarrow-up-right

Playground

You can play around with the tokenizer herearrow-up-right.

Last updated