Tokenizers
Whitespace
"Dot space" redirects here. For the animated film, see Dot in Space."␣" redirects here. It is not to be confused with ⌴.
In computer programming, whitespace is any character or series of characters that represent horizontal or vertical space in typography. When rendered, a whitespace character does not correspond to a visible mark, but typically does occupy an area on a page. For example, the common whitespace symbol U+0020 SPACE (also ASCII 32) represents a blank space punctuation character in text, used as a word divider in Western scripts.
Wink-tokenizer
By default, Datasaur uses the wink-tokenizer to separate text into individual tokens, which are grouped by lines of sentences.
"Wink is a family of open source packages for Statistical Analysis, Natural Language Processing and Machine Learning in NodeJS." (source: https://winkjs.org/about.html)
How does it work? It works by using pattern matching via regular expressions (regex). The list of regexes used by wink-tokenizer can be seen in detail at https://winkjs.org/wink-tokenizer/Tokenizer.html#defineConfig. In summary, it will parse the following by default:
currency
such as $ or £ symbols (r)
for example john@acme.com or superman1@gmail.com (e)
emoji
any standard unicode emojis e.g. - ✅ or ❌ -> 1 UTF-16 unit - 😊 or 😂 or 🎉 -> 2 UTF-16 units - 🇮🇩 or 🇺🇸 -> 4 UTF-16 units (j)
emoticon
common emoticons such as :-) or :D (c)
hashtag
hash tags such as #happy or #followme (h)
number
any integer, decimal number, fractions such as 19, 2.718 or 1/4 and numerals containing ", - / .", for example 12-12-1924 (n)
ordinal
ordinals like 1st, 2nd, 3rd, 4th or 12th or 91st (o)
punctuation
common punctuation such as ? or , ( token becomes fingerprint )
quoted_phrase
any "quoted text" in the sentence. Note: its default value is false. (q)
symbol
for example ~ or + or & or % or / ( token becomes fingerprint )
time
common representation of time such as 4pm or 16:00 hours (t)
mention
@mention as in github or twitter (m)
url
word
word such as faster or résumé or prévenir (w)
To learn more details about the regex patterns used, you can directly view the source code at https://github.com/winkjs/wink-tokenizer/blob/master/src/wink-tokenizer.js
Playground
You can play around with the tokenizer here.
Last updated