Tokenizers
Whitespace
"Dot space" redirects here. For the animated film, see Dot in Space."␣" redirects here. It is not to be confused with ⌴.
In computer programming, whitespace is any character or series of characters that represent horizontal or vertical space in typography. When rendered, a whitespace character does not correspond to a visible mark, but typically does occupy an area on a page. For example, the common whitespace symbol U+0020 SPACE (also ASCII 32) represents a blank space punctuation character in text, used as a word divider in Western scripts.
Wink-tokenizer
By default, Datasaur uses the wink-tokenizer to separate text into individual tokens, which are grouped by lines of sentences.
"Wink is a family of open source packages for Statistical Analysis, Natural Language Processing and Machine Learning in NodeJS." (source: https://winkjs.org/about.html)
How does it work? It works by using pattern matching via regular expressions (regex). The list of regexes used by wink-tokenizer can be seen in detail at https://winkjs.org/wink-tokenizer/Tokenizer.html#defineConfig. In summary, it will parse the following by default:
To learn more details about the regex patterns used, you can directly view the source code at https://github.com/winkjs/wink-tokenizer/blob/master/src/wink-tokenizer.js
Playground
You can play around with the tokenizer here.
Last updated