# Tokenizers

## Whitespace

*"Dot space" redirects here. For the animated film, see* [*Dot in Space*](https://en.wikipedia.org/wiki/Dot_in_Space)*."␣" redirects here. It is not to be confused with* [*⌴*](https://en.wikipedia.org/wiki/%E2%8C%B4)*.*

In [computer programming](https://en.wikipedia.org/wiki/Computer_programming), [**whitespace**](https://en.wikipedia.org/wiki/Whitespace_character) is any [character](https://en.wikipedia.org/wiki/Character_\(computing\)) or series of characters that represent horizontal or vertical [space](https://en.wikipedia.org/wiki/Space_\(punctuation\)) in [typography](https://en.wikipedia.org/wiki/Typography). When rendered, a whitespace character does not correspond to a visible mark, but typically does occupy an area on a page. For example, the common whitespace symbol U+0020 SPACE (also [ASCII](https://en.wikipedia.org/wiki/ASCII) 32) represents a [blank space](https://en.wikipedia.org/wiki/Space_\(punctuation\)) punctuation character, used as a [word divider](https://en.wikipedia.org/wiki/Word_divider) in Western [scripts](https://en.wikipedia.org/wiki/Writing_system).

## Wink-tokenizer

By default, Datasaur uses the [wink-tokenizer](https://winkjs.org/wink-tokenizer/) to separate text into individual tokens, which are grouped by lines of sentences.

*"Wink is a family of open source packages for **Statistical Analysis**, **Natural Language Processing** and **Machine Learning** in NodeJS."* (source: <https://winkjs.org/about.html>)

**How does it work?** It works by using pattern matching via regular expressions (regex). The list of regexes used by wink-tokenizer can be seen in detail at <https://winkjs.org/wink-tokenizer/Tokenizer.html#defineConfig>. In summary, it will parse the following by default:

| Name           | Description                                                                                                                                                    |
| -------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| currency       | such as $ or £ symbols (r)                                                                                                                                     |
| email          | for example <john@acme.com> or <superman1@gmail.com> (e)                                                                                                       |
| emoji          | <p>any standard Unicode emojis, for example:<br>- ✅ or ❌ -> 1 UTF-16 unit<br>- 😊 or 😂 or 🎉 -> 2 UTF-16 units<br>- 🇮🇩 or 🇺🇸 -> 4 UTF-16 units<br>(j)</p> |
| emoticon       | common emoticons such as :-) or :D (c)                                                                                                                         |
| hashtag        | hash tags such as #happy or #followme (h)                                                                                                                      |
| number         | any integer, decimal number, fractions such as 19, 2.718 or 1/4 and numerals containing ", - / .", for example 12-12-1924 (n)                                  |
| ordinal        | ordinals like 1st, 2nd, 3rd, 4th or 12th or 91st (o)                                                                                                           |
| punctuation    | common punctuation such as ? or , (token becomes fingerprint)                                                                                                  |
| quoted\_phrase | any "quoted text" in the sentence. Note: its default value is false. (q)                                                                                       |
| symbol         | for example \~ or + or & or % or / ( token becomes fingerprint )                                                                                               |
| time           | common representation of time such as 4pm or 16:00 hours (t)                                                                                                   |
| mention        | @mention as in github or twitter (m)                                                                                                                           |
| url            | URL such as <https://github.com> (u)                                                                                                                           |
| word           | word such as faster or résumé or prévenir (w)                                                                                                                  |

To learn more details about the regex patterns used, you can directly view the source code at <https://github.com/winkjs/wink-tokenizer/blob/master/src/wink-tokenizer.js>

### **Playground**

You can play around with the tokenizer [here](https://wjjrd.csb.app/).
