Tokenizers

A tokenizer splits text into smaller units (tokens) like words and sentences for processing.

Whitespace

Whitespace refers to characters that represent empty space in text, such as spaces, tabs, or line breaks. These characters are not visible when rendered, but they affect text structure and formatting.

Wink tokenizer

The wink tokenizer is a tool that splits text into smaller units called tokens, such as words and sentences, so it can be processed more easily. It works by using pattern matching via regular expressions (regex). The list of regex patterns used by wink-tokenizer can be found in its documentationarrow-up-right.

In summary, it handles the following by default:

Name
Description

currency

such as $ or £ symbols (r)

email

emoji

any standard Unicode emojis, for example: - `✅ or ❌ -> 1 UTF-16 unit - 😊 or 😂 or 🎉 -> 2 UTF-16 units - 🇮🇩 or 🇺🇸 -> 4 UTF-16 units (j)

emoticon

common emoticons such as :-) or :D (c)

hashtag

hashtags such as #happy (h)

number

any integer, decimal number, fractions such as 19, 2.718, or 1/4 and numerals containing , - / ., for example 12-12-1924 (n)

ordinal

ordinals like 1st, 12th, or 91st (o)

punctuation

common punctuation such as `?` or , (token becomes fingerprint)

quoted_phrase

any "quoted text" in the sentence. Note: its default value is false. (q)

symbol

for example ~ or + or & or % or / (token becomes fingerprint)

time

common representation of time such as 4pm or 16:00 (t)

mention

@mention as in github or twitter (m)

url

URL such as https://github.com (u)

word

word such as faster or résumé (w)

To learn more about the regex patterns used, you can view the source code herearrow-up-right.

Playground

You can also try the tokenizer in the interactive playgroundarrow-up-right to see how it splits text in real time.

Last updated