Tokenizers
A tokenizer splits text into smaller units (tokens) like words and sentences for processing.
Whitespace
Whitespace refers to characters that represent empty space in text, such as spaces, tabs, or line breaks. These characters are not visible when rendered, but they affect text structure and formatting.
Wink tokenizer
The wink tokenizer is a tool that splits text into smaller units called tokens, such as words and sentences, so it can be processed more easily. It works by using pattern matching via regular expressions (regex). The list of regex patterns used by wink-tokenizer can be found in its documentation.
In summary, it handles the following by default:
currency
such as $ or £ symbols (r)
for example [email protected] or [email protected] (e)
emoji
any standard Unicode emojis, for example: - `✅ or ❌ -> 1 UTF-16 unit - 😊 or 😂 or 🎉 -> 2 UTF-16 units - 🇮🇩 or 🇺🇸 -> 4 UTF-16 units (j)
emoticon
common emoticons such as :-) or :D (c)
hashtag
hashtags such as #happy (h)
number
any integer, decimal number, fractions such as 19, 2.718, or 1/4 and numerals containing , - / ., for example 12-12-1924 (n)
ordinal
ordinals like 1st, 12th, or 91st (o)
punctuation
common punctuation such as `?` or , (token becomes fingerprint)
quoted_phrase
any "quoted text" in the sentence. Note: its default value is false. (q)
symbol
for example ~ or + or & or % or / (token becomes fingerprint)
time
common representation of time such as 4pm or 16:00 (t)
mention
@mention as in github or twitter (m)
url
URL such as https://github.com (u)
word
word such as faster or résumé (w)
To learn more about the regex patterns used, you can view the source code here.
Playground
You can also try the tokenizer in the interactive playground to see how it splits text in real time.
Last updated