# Tokenizers

## Whitespace

Whitespace refers to characters that represent empty space in text, such as spaces, tabs, or line breaks. These characters are not visible when rendered, but they affect text structure and formatting.

## Wink tokenizer

The wink tokenizer is a tool that splits text into smaller units called tokens, such as words and sentences, so it can be processed more easily. It works by using pattern matching via regular expressions (regex). The list of regex patterns used by wink-tokenizer can be found in its [documentation](https://winkjs.org/wink-tokenizer/Tokenizer.html#defineConfig).

In summary, it handles the following by default:

<table><thead><tr><th width="188.08203125">Name</th><th>Description</th></tr></thead><tbody><tr><td>currency</td><td>such as <code>$</code> or <code>£</code> symbols (r)</td></tr><tr><td>email</td><td>for example <code>john@acme.com</code> or <code>superman1@gmail.com</code> (e)</td></tr><tr><td>emoji</td><td>any standard Unicode emojis, for example:<br>- `✅ or ❌ -> 1 UTF-16 unit<br>- 😊 or 😂 or 🎉 -> 2 UTF-16 units<br>- 🇮🇩 or 🇺🇸 -> 4 UTF-16 units<br>(j)</td></tr><tr><td>emoticon</td><td>common emoticons such as <code>:-)</code> or <code>:D</code> (c)</td></tr><tr><td>hashtag</td><td>hashtags such as <code>#happy</code> (h)</td></tr><tr><td>number</td><td>any integer, decimal number, fractions such as <code>19</code>, <code>2.718</code>, or <code>1/4</code> and numerals containing <code>,</code> <code>-</code> <code>/</code> <code>.</code>, for example <code>12-12-1924</code> (n)</td></tr><tr><td>ordinal</td><td>ordinals like <code>1st</code>, <code>12th</code>, or <code>91st</code> (o)</td></tr><tr><td>punctuation</td><td>common punctuation such as `?` or <code>,</code> (token becomes fingerprint)</td></tr><tr><td>quoted_phrase</td><td>any <code>"quoted text"</code> in the sentence. Note: its default value is false. (q)</td></tr><tr><td>symbol</td><td>for example <code>~</code> or <code>+</code> or <code>&#x26;</code> or <code>%</code> or <code>/</code> (token becomes fingerprint)</td></tr><tr><td>time</td><td>common representation of time such as <code>4pm</code> or <code>16:00</code> (t)</td></tr><tr><td>mention</td><td><code>@mention</code> as in github or twitter (m)</td></tr><tr><td>url</td><td>URL such as <code>https://github.com</code> (u)</td></tr><tr><td>word</td><td>word such as <code>faster</code> or <code>résumé</code> (w)</td></tr></tbody></table>

To learn more about the regex patterns used, you can view [the source code here](https://github.com/winkjs/wink-tokenizer/blob/master/src/wink-tokenizer.js).

### Playground

You can also try the tokenizer in the [interactive playground](https://wjjrd.csb.app/) to see how it splits text in real time.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.datasaur.ai/advanced/tokenizer.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
