Tokenizers
Last updated
Last updated
"Dot space" redirects here. For the animated film, see ."␣" redirects here. It is not to be confused with .
In , is any or series of characters that represent horizontal or vertical in . When rendered, a whitespace character does not correspond to a visible mark, but typically does occupy an area on a page. For example, the common whitespace symbol U+0020 SPACE (also 32) represents a punctuation character in text, used as a in Western .
By default, Datasaur uses the to separate text into individual tokens, which are grouped by lines of sentences.
"Wink is a family of open source packages for Statistical Analysis, Natural Language Processing and Machine Learning in NodeJS." (source: )
How does it work? It works by using pattern matching via regular expressions (regex). The list of regexes used by wink-tokenizer can be seen in detail at . In summary, it will parse the following by default:
currency
such as $ or £ symbols (r)
for example john@acme.com or superman1@gmail.com (e)
emoji
any standard unicode emojis e.g. - ✅ or ❌ -> 1 UTF-16 unit - 😊 or 😂 or 🎉 -> 2 UTF-16 units - 🇮🇩 or 🇺🇸 -> 4 UTF-16 units (j)
emoticon
common emoticons such as :-) or :D (c)
hashtag
hash tags such as #happy or #followme (h)
number
any integer, decimal number, fractions such as 19, 2.718 or 1/4 and numerals containing ", - / .", for example 12-12-1924 (n)
ordinal
ordinals like 1st, 2nd, 3rd, 4th or 12th or 91st (o)
punctuation
common punctuation such as ? or , ( token becomes fingerprint )
quoted_phrase
any "quoted text" in the sentence. Note: its default value is false. (q)
symbol
for example ~ or + or & or % or / ( token becomes fingerprint )
time
common representation of time such as 4pm or 16:00 hours (t)
mention
@mention as in github or twitter (m)
url
word
word such as faster or résumé or prévenir (w)
URL such as (u)
To learn more details about the regex patterns used, you can directly view the source code at
You can play around with the tokenizer .