# NLTK

**Supported Labeling Types**: `Span labeling`

NLTK (Natural Language Toolkit) is an open-source Python library for natural language processing (NLP). It provides tools for text preprocessing such as tokenization, stemming, lemmatization, part-of-speech tagging, and more. In the context of our labeling platform, NLTK can be integrated to support various preprocessing tasks that help improve label consistency and model training quality. Its ease of use and rich set of linguistic resources make it a useful option for preparing and analyzing text data before or during the labeling process.

<figure><img src="https://448889121-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MbjY0HseEqu7LtYAt4d%2Fuploads%2Fgit-blob-a062623c5beca7e8f81bd2b4597c452d46e6f3f5%2FExtension%20-%20ML-assisted%20Labeling%20-%20Span%20labeling%20-%20NLTK%20-%20highlight.png?alt=media" alt="Image of ML Assisted with NLTK"><figcaption><p>ML Assisted with NLTK</p></figcaption></figure>

### **Model Details**

* NLTK POS-tagging is performed using `nltk.pos_tag`, which internally utilizes the `nltk.PerceptronTagger`. This is a fast and accurate approach for part-of-speech tagging in English.
* The underlying models are trained on the Wall Street Journal section of the Penn Treebank, providing strong performance on formal, edited text common in business documents.
* The tagger assigns grammatical categories to words based on the **UPenn Treebank Tagset**, which includes categories like nouns, verbs, adjectives, adverbs, and more.
* Fully integrated into the Datasaur Intelligence container for consistent, dependency-free operation.

> Greedy Averaged Perceptron tagger, as implemented by Matthew Honnibal. See more implementation details here: <https://explosion.ai/blog/part-of-speech-pos-tagger-in-python>>

### Usage

* Text preprocessing for consistency and model training improvements.
* Supports syntactic analysis in annotation workflows.
* Tag set: [UPenn Treebank Tag Set](https://docs.datasaur.ai/assisted-labeling/ml-assisted-labeling/nltk#upenn-treebank-tagset)
  * The detailed available [here](https://docs.datasaur.ai/assisted-labeling/ml-assisted-labeling/nltk#appendix)

### References

* <https://www.nltk.org/book/ch05.html>
* <https://www.nltk.org/api/nltk.tag.html>
* <https://www.nltk.org/api/nltk.tag.perceptron.html>

## Appendix

### NLTK Treebank

<table><thead><tr><th width="374"></th><th></th></tr></thead><tbody><tr><td><code>$</code></td><td>dollar e.g. <code>$</code>, <code>-$</code>, <code>--$</code>, <code>A$</code>, <code>C$</code>, <code>HK$</code>, <code>M$</code>, <code>NZ$</code>, <code>S$</code>, <code>U.S.$</code>, <code>US$</code></td></tr><tr><td><code>''</code></td><td>closing quotation mark e.g. <code>'</code>, <code>''</code></td></tr><tr><td><code>(</code></td><td>opening parenthesis e.g. <code>(</code>, <code>[</code>, <code>{</code></td></tr><tr><td><code>,</code></td><td>comma e.g. <code>,</code></td></tr><tr><td><code>--</code></td><td>dash e.g. <code>--</code></td></tr><tr><td><code>.</code></td><td>sentence terminator e.g. <code>.</code>, <code>!</code>, <code>?</code></td></tr><tr><td><code>:</code></td><td>colon or ellipsis e.g. <code>:</code>, <code>;</code>, <code>...</code></td></tr><tr><td><code>``</code></td><td>opening quotation mark e.g. <code>`</code>, <code>``</code></td></tr></tbody></table>

### Treebank Tagset

<table><thead><tr><th width="288">Tag</th><th>Description</th></tr></thead><tbody><tr><td><code>CC</code></td><td>conjunction, coordinating e.g. <code>&#x26;</code>, <code>'n</code>, <code>and</code>, <code>both</code>, <code>but</code>, <code>either</code>, <code>et</code>, <code>for</code>, <code>less</code>, <code>minus</code>, <code>neither</code>, <code>nor</code>, <code>or</code>, <code>plus</code>, <code>so</code>, <code>therefore</code>, <code>times</code>, <code>v.</code>, <code>versus</code>, <code>vs.</code>, <code>whether</code>, <code>yet</code></td></tr><tr><td><code>CD</code></td><td>numeral, cardinal e.g. <code>mid-1890</code>, <code>nine-thirty</code>, <code>forty-two</code>, <code>one-tenth</code>, <code>ten</code>, <code>million</code>, <code>0.5</code>, <code>one</code>, <code>forty-</code>, <code>seven</code>, <code>1987</code>, <code>twenty</code>, <code>'79</code>, <code>zero</code>, <code>two</code>, <code>78-degrees</code>, <code>eighty-four</code>, <code>IX</code>, <code>'60s</code>, <code>.025</code>, <code>fifteen</code>, <code>271,124</code>, <code>dozen</code>, <code>quintillion</code>, <code>DM2,000</code>, ...</td></tr><tr><td><code>DT</code></td><td>determiner e.g. <code>all</code>, <code>an</code>, <code>another</code>, <code>any</code>, <code>both</code>, <code>del</code>, <code>each</code>, <code>either</code>, <code>every</code>, <code>half</code>, <code>la</code>, <code>many</code>, <code>much</code>, <code>nary</code>, <code>neither</code>, <code>no</code>, <code>some</code>, <code>such</code>, <code>that</code>, <code>the</code>, <code>them</code>, <code>these</code>, <code>this</code>, <code>those</code></td></tr><tr><td><code>EX</code></td><td>existential there e.g. <code>there</code></td></tr><tr><td><code>FW</code></td><td>foreign word e.g. <code>gemeinschaft</code>, <code>hund</code>, <code>ich</code>, <code>jeux</code>, <code>habeas</code>, <code>Haementeria</code>, <code>Herr</code>, <code>K'ang-si</code>, <code>vous</code>, <code>lutihaw</code>, <code>alai</code>, <code>je</code>, <code>jour</code>, <code>objets</code>, <code>salutaris</code>, <code>fille</code>, <code>quibusdam</code>, <code>pas</code>, <code>trop</code>, <code>Monte</code>, <code>terram</code>, <code>fiche</code>, <code>oui</code>, <code>corporis</code>, ...</td></tr><tr><td><code>IN</code></td><td>preposition or conjunction, subordinating e.g. <code>astride</code>, <code>among</code>, <code>uppon</code>, <code>whether</code>, <code>out</code>, <code>inside</code>, <code>pro</code>, <code>despite</code>, <code>on</code>, <code>by</code>, <code>throughout</code>, <code>below</code>, <code>within</code>, <code>for</code>, <code>towards</code>, <code>near</code>, <code>behind</code>, <code>atop</code>, <code>around</code>, <code>if</code>, <code>like</code>, <code>until</code>, <code>below</code>, <code>next</code>, <code>into</code>, <code>if</code>, <code>beside</code>, ...</td></tr><tr><td><code>JJ</code></td><td>adjective or numeral, ordinal e.g. <code>third</code>, <code>ill-mannered</code>, <code>pre-war</code>, <code>regrettable</code>, <code>oiled</code>, <code>calamitous</code>, <code>first</code>, <code>separable</code>, <code>ectoplasmic</code>, <code>battery-powered</code>, <code>participatory</code>, <code>fourth</code>, <code>still-to-be-named</code>, <code>multilingual</code>, <code>multi-disciplinary</code>, ...</td></tr><tr><td><code>JJR</code></td><td>adjective, comparative e.g. <code>bleaker</code>, <code>braver</code>, <code>breezier</code>, <code>briefer</code>, <code>brighter</code>, <code>brisker</code>, <code>broader</code>, <code>bumper</code>, <code>busier</code>, <code>calmer</code>, <code>cheaper</code>, <code>choosier</code>, <code>cleaner</code>, <code>clearer</code>, <code>closer</code>, <code>colder</code>, <code>commoner</code>, <code>costlier</code>, <code>cozier</code>, <code>creamier</code>, <code>crunchier</code>, <code>cuter</code>, ...</td></tr><tr><td><code>JJS</code></td><td>adjective, superlative e.g. <code>calmest</code>, <code>cheapest</code>, <code>choicest</code>, <code>classiest</code>, <code>cleanest</code>, <code>clearest</code>, <code>closest</code>, <code>commonest</code>, <code>corniest</code>, <code>costliest</code>, <code>crassest</code>, <code>creepiest</code>, <code>crudest</code>, <code>cutest</code>, <code>darkest</code>, <code>deadliest</code>, <code>dearest</code>, <code>deepest</code>, <code>densest</code>, <code>dinkiest</code>, ...</td></tr><tr><td><code>LS</code></td><td>list item marker e.g. <code>A</code>, <code>A.</code>, <code>B</code>, <code>B.</code>, <code>C</code>, <code>C.</code>, <code>D</code>, <code>E</code>, <code>F</code>, <code>First</code>, <code>G</code>, <code>H</code>, <code>I</code>, <code>J</code>, <code>K</code>, <code>One</code>, <code>SP-44001</code>, <code>SP-44002</code>, <code>SP-44005</code>, <code>SP-44007</code>, <code>Second</code>, <code>Third</code>, <code>Three</code>, <code>Two</code>, <code>*</code>, <code>a</code>, <code>b</code>, <code>c</code>, <code>d</code>, <code>first</code>, <code>five</code>, <code>four</code>, <code>one</code>, <code>six</code>, <code>three</code>, <code>two</code></td></tr><tr><td><code>MD</code></td><td>modal auxiliary e.g. <code>can</code>, <code>cannot</code>, <code>could</code>, <code>couldn't</code>, <code>dare</code>, <code>may</code>, <code>might</code>, <code>must</code>, <code>need</code>, <code>ought</code>, <code>shall</code>, <code>should</code>, <code>shouldn't</code>, <code>will</code>, <code>would</code></td></tr><tr><td><code>NN</code></td><td>noun, common, singular or mass e.g. <code>common-carrier</code>, <code>cabbage</code>, <code>knuckle-duster</code>, <code>Casino</code>, <code>afghan</code>, <code>shed</code>, <code>thermostat</code>, <code>investment</code>, <code>slide</code>, <code>humour</code>, <code>falloff</code>, <code>slick</code>, <code>wind</code>, <code>hyena</code>, <code>override</code>, <code>subhumanity</code>, <code>machinist</code>, ...</td></tr><tr><td><code>NNP</code></td><td>noun, proper, singular e.g. <code>Motown</code>, <code>Venneboerger</code>, <code>Czestochwa</code>, <code>Ranzer</code>, <code>Conchita</code>, <code>Trumplane</code>, <code>Christos</code>, <code>Oceanside</code>, <code>Escobar</code>, <code>Kreisler</code>, <code>Sawyer</code>, <code>Cougar</code>, <code>Yvette</code>, <code>Ervin</code>, <code>ODI</code>, <code>Darryl</code>, <code>CTCA</code>, <code>Shannon</code>, <code>A.K.C.</code>, <code>Meltex</code>, <code>Liverpool</code>, ...</td></tr><tr><td><code>NNPS</code></td><td>noun, proper, plural e.g. <code>Americans</code>, <code>Americas</code>, <code>Amharas</code>, <code>Amityvilles</code>, <code>Amusements</code>, <code>Anarcho-Syndicalists</code>, <code>Andalusians</code>, <code>Andes</code>, <code>Andruses</code>, <code>Angels</code>, <code>Animals</code>, <code>Anthony</code>, <code>Antilles</code>, <code>Antiques</code>, <code>Apache</code>, <code>Apaches</code>, <code>Apocrypha</code>, ...</td></tr><tr><td><code>NNS</code></td><td>noun, common, plural e.g. <code>undergraduates</code>, <code>scotches</code>, <code>bric-a-brac</code>, <code>products</code>, <code>bodyguards</code>, <code>facets</code>, <code>coasts</code>, <code>divestitures</code>, <code>storehouses</code>, <code>designs</code>, <code>clubs</code>, <code>fragrances</code>, <code>averages</code>, <code>subjectivists</code>, <code>apprehensions</code>, <code>muses</code>, <code>factory-jobs</code>, ...</td></tr><tr><td><code>PDT</code></td><td>pre-determiner e.g. <code>all</code>, <code>both</code>, <code>half</code>, <code>many</code>, <code>quite</code>, <code>such</code>, <code>sure</code>, <code>this</code></td></tr><tr><td><code>POS</code></td><td>genitive marker e.g. <code>'</code>, <code>'s</code></td></tr><tr><td><code>PRP</code></td><td>pronoun, personal e.g. <code>hers</code>, <code>herself</code>, <code>him</code>, <code>himself</code>, <code>hisself</code>, <code>it</code>, <code>itself</code>, <code>me</code>, <code>myself</code>, <code>one</code>, <code>oneself</code>, <code>ours</code>, <code>ourselves</code>, <code>ownself</code>, <code>self</code>, <code>she</code>, <code>thee</code>, <code>theirs</code>, <code>them</code>, <code>themselves</code>, <code>they</code>, <code>thou</code>, <code>thy</code>, <code>us</code></td></tr><tr><td><code>PRP$</code></td><td>pronoun, possessive e.g. <code>her</code>, <code>his</code>, <code>mine</code>, <code>my</code>, <code>our</code>, <code>ours</code>, <code>their</code>, <code>thy</code>, <code>your</code></td></tr><tr><td><code>RB</code></td><td>adverb e.g. <code>occasionally</code>, <code>unabatingly</code>, <code>maddeningly</code>, <code>adventurously</code>, <code>professedly</code>, <code>stirringly</code>, <code>prominently</code>, <code>technologically</code>, <code>magisterially</code>, <code>predominately</code>, <code>swiftly</code>, <code>fiscally</code>, <code>pitilessly</code>, ...</td></tr><tr><td><code>RBR</code></td><td>adverb, comparative e.g. <code>further</code>, <code>gloomier</code>, <code>grander</code>, <code>graver</code>, <code>greater</code>, <code>grimmer</code>, <code>harder</code>, <code>harsher</code>, <code>healthier</code>, <code>heavier</code>, <code>higher</code>, <code>however</code>, <code>larger</code>, <code>later</code>, <code>leaner</code>, <code>lengthier</code>, <code>less-</code>, <code>perfectly</code>, <code>lesser</code>, <code>lonelier</code>, <code>longer</code>, <code>louder</code>, <code>lower</code>, <code>more</code>, ...</td></tr><tr><td><code>RBS</code></td><td>adverb, superlative e.g. <code>best</code>, <code>biggest</code>, <code>bluntest</code>, <code>earliest</code>, <code>farthest</code>, <code>first</code>, <code>furthest</code>, <code>hardest</code>, <code>heartiest</code>, <code>highest</code>, <code>largest</code>, <code>least</code>, <code>less</code>, <code>most</code>, <code>nearest</code>, <code>second</code>, <code>tightest</code>, <code>worst</code></td></tr><tr><td><code>RBS</code></td><td>adverb, superlative e.g. <code>best</code>, <code>biggest</code>, <code>bluntest</code>, <code>earliest</code>, <code>farthest</code>, <code>first</code>, <code>furthest</code>, <code>hardest</code>, <code>heartiest</code>, <code>highest</code>, <code>largest</code>, <code>least</code>, <code>less</code>, <code>most</code>, <code>nearest</code>, <code>second</code>, <code>tightest</code>, <code>worst</code></td></tr><tr><td><code>RBS</code></td><td>particle e.g. <code>aboard</code>, <code>about</code>, <code>across</code>, <code>along</code>, <code>apart</code>, <code>around</code>, <code>aside</code>, <code>at</code>, <code>away</code>, <code>back</code>, <code>before</code>, <code>behind</code>, <code>by</code>, <code>crop</code>, <code>down</code>, <code>ever</code>, <code>fast</code>, <code>for</code>, <code>forth</code>, <code>from</code>, <code>go</code>, <code>high</code>, <code>i.e.</code>, <code>in</code>, <code>into</code>, <code>just</code>, <code>later</code>, <code>low</code>, <code>more</code>, <code>off</code>, <code>on</code>, <code>open</code>, <code>out</code>, <code>over</code>, <code>per</code>, <code>pie</code>, <code>raising</code>, <code>start</code>, <code>teeth</code>, <code>that</code>, <code>through</code>, <code>under</code>, <code>unto</code>, <code>up</code>, <code>up-pp</code>, <code>upon</code>, <code>whole</code>, <code>with</code>, <code>you</code></td></tr><tr><td><code>RP</code></td><td>particle e.g. <code>aboard</code>, <code>about</code>, <code>across</code>, <code>along</code>, <code>apart</code>, <code>around</code>, <code>aside</code>, <code>at</code>, <code>away</code>, <code>back</code>, <code>before</code>, <code>behind</code>, <code>by</code>, <code>crop</code>, <code>down</code>, <code>ever</code>, <code>fast</code>, <code>for</code>, <code>forth</code>, <code>from</code>, <code>go</code>, <code>high</code>, <code>i.e.</code>, <code>in</code>, <code>into</code>, <code>just</code>, <code>later</code>, <code>low</code>, <code>more</code>, <code>off</code>, <code>on</code>, <code>open</code>, <code>out</code>, <code>over</code>, <code>per</code>, <code>pie</code>, <code>raising</code>, <code>start</code>, <code>teeth</code>, <code>that</code>, <code>through</code>, <code>under</code>, <code>unto</code>, <code>up</code>, <code>up-pp</code>, <code>upon</code>, <code>whole</code>, <code>with</code>, <code>you</code></td></tr><tr><td><code>SYM</code></td><td>symbol e.g. <code>%</code>, <code>&#x26;</code>, <code>'</code>, <code>''</code>, <code>''.</code>, <code>)</code>, <code>).</code>, <code>*</code>, <code>+</code>, <code>,.</code>, <code>&#x3C;</code>, <code>=</code>, <code>></code>, <code>@</code>, <code>A[fj]</code>, <code>U.S</code>, <code>U.S.S.R</code>, <code>*</code>, <code>**</code>, <code>***</code></td></tr><tr><td><code>TO</code></td><td>"to" as preposition or infinitive marker e.g. <code>to</code></td></tr><tr><td><code>UH</code></td><td>interjection e.g. <code>Goodbye</code>, <code>Goody</code>, <code>Gosh</code>, <code>Wow</code>, <code>Jeepers</code>, <code>Jee-sus</code>, <code>Hubba</code>, <code>Hey</code>, <code>Kee-reist</code>, <code>Oops</code>, <code>amen</code>, <code>huh</code>, <code>howdy</code>, <code>uh</code>, <code>dammit</code>, <code>whammo</code>, <code>shucks</code>, <code>heck</code>, <code>anyways</code>, <code>whodunnit</code>, <code>honey</code>, <code>golly</code>, <code>man</code>, <code>baby</code>, <code>diddle</code>, <code>hush</code>, <code>sonuvabitch</code>, ...</td></tr><tr><td><code>VB</code></td><td>verb, base form e.g. <code>ask</code>, <code>assemble</code>, <code>assess</code>, <code>assign</code>, <code>assume</code>, <code>atone</code>, <code>attention</code>, <code>avoid</code>, <code>bake</code>, <code>balkanize</code>, <code>bank</code>, <code>begin</code>, <code>behold</code>, <code>believe</code>, <code>bend</code>, <code>benefit</code>, <code>bevel</code>, <code>beware</code>, <code>bless</code>, <code>boil</code>, <code>bomb</code>, <code>boost</code>, <code>brace</code>, <code>break</code>, <code>bring</code>, <code>broil</code>, <code>brush</code>, <code>build</code>, ...</td></tr><tr><td><code>VBD</code></td><td>verb, base form e.g. <code>ask</code>, <code>assemble</code>, <code>assess</code>, <code>assign</code>, <code>assume</code>, <code>atone</code>, <code>attention</code>, <code>avoid</code>, <code>bake</code>, <code>balkanize</code>, <code>bank</code>, <code>begin</code>, <code>behold</code>, <code>believe</code>, <code>bend</code>, <code>benefit</code>, <code>bevel</code>, <code>beware</code>, <code>bless</code>, <code>boil</code>, <code>bomb</code>, <code>boost</code>, <code>brace</code>, <code>break</code>, <code>bring</code>, <code>broil</code>, <code>brush</code>, <code>build</code>, ...</td></tr><tr><td><code>VBG</code></td><td>verb, present participle or gerund e.g. <code>telegraphing</code>, <code>stirring</code>, <code>focusing</code>, <code>angering</code>, <code>judging</code>, <code>stalling</code>, <code>lactating</code>, <code>hankerin'</code>, <code>alleging</code>, <code>veering</code>, <code>capping</code>, <code>approaching</code>, <code>traveling</code>, <code>besieging</code>, <code>encrypting</code>, <code>interrupting</code>, <code>erasing</code>, <code>wincing</code>, ...</td></tr><tr><td><code>VBN</code></td><td>verb, past participle e.g. <code>multihulled</code>, <code>dilapidated</code>, <code>aerosolized</code>, <code>chaired</code>, <code>languished</code>, <code>panelized</code>, <code>used</code>, <code>experimented</code>, <code>flourished</code>, <code>imitated</code>, <code>reunifed</code>, <code>factored</code>, <code>condensed</code>, <code>sheared</code>, <code>unsettled</code>, <code>primed</code>, <code>dubbed</code>, <code>desired</code>, ...</td></tr><tr><td><code>VBP</code></td><td>verb, present tense, not 3rd person singular e.g. <code>predominate</code>, <code>wrap</code>, <code>resort</code>, <code>sue</code>, <code>twist</code>, <code>spill</code>, <code>cure</code>, <code>lengthen</code>, <code>brush</code>, <code>terminate</code>, <code>appear</code>, <code>tend</code>, <code>stray</code>, <code>glisten</code>, <code>obtain</code>, <code>comprise</code>, <code>detest</code>, <code>tease</code>, <code>attract</code>, <code>emphasize</code>, <code>mold</code>, <code>postpone</code>, <code>sever</code>, <code>return</code>, <code>wag</code>, ...</td></tr><tr><td><code>VBZ</code></td><td>verb, present tense, 3rd person singular e.g. <code>bases</code>, <code>reconstructs</code>, <code>marks</code>, <code>mixes</code>, <code>displeases</code>, <code>seals</code>, <code>carps</code>, <code>weaves</code>, <code>snatches</code>, <code>slumps</code>, <code>stretches</code>, <code>authorizes</code>, <code>smolders</code>, <code>pictures</code>, <code>emerges</code>, <code>stockpiles</code>, <code>seduces</code>, <code>fizzes</code>, <code>uses</code>, <code>bolsters</code>, <code>slaps</code>, <code>speaks</code>, <code>pleads</code>, ...</td></tr><tr><td><code>WDT</code></td><td>WH-determiner e.g. <code>that</code>, <code>what</code>, <code>whatever</code>, <code>which</code>, <code>whichever</code></td></tr><tr><td><code>WP</code></td><td>WH-determiner e.g. <code>that</code>, <code>what</code>, <code>whatever</code>, <code>which</code>, <code>whichever</code></td></tr><tr><td><code>WP$</code></td><td>WH-pronoun, possessive e.g. <code>whose</code></td></tr><tr><td><code>WRB</code></td><td>WH-adverb e.g. <code>how</code>, <code>however</code>, <code>whence</code>, <code>whenever</code>, <code>where</code>, <code>whereby</code>, <code>whereever</code>, <code>wherein</code>, <code>whereof</code>, <code>why</code></td></tr></tbody></table>

### References

* UPenn Treebank Docs <https://catalog.ldc.upenn.edu/docs/LDC99T42/>
* `python -c "import nltk; nltk.help.upenn_tagset()"`


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.datasaur.ai/assisted-labeling/ml-assisted-labeling/nltk.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
