Datasaur
Visit our websitePricingBlogPlaygroundAPI Docs
  • Welcome to Datasaur
    • Getting started with Datasaur
  • Data Studio Projects
    • Labeling Task Types
      • Span Based
        • OCR Labeling
        • Audio Project
      • Row Based
      • Document Based
      • Bounding Box
      • Conversational
      • Mixed Labeling
      • Project Templates
        • Test Project
    • Creating a Project
      • Data Formats
      • Data Samples
      • Split Files
      • Consensus
      • Dynamic Review Capabilities
    • Pre-Labeled Project
    • Let's Get Labeling!
      • Span Based
        • Span + Line Labeling
      • Row & Document Based
      • Bounding Box Labeling
      • Conversational Labeling
      • Label Sets / Question Sets
        • Dynamic Question Set
      • Multiple Label Sets
    • Reviewing Projects
      • Review Sampling
    • Adding Documents to an Ongoing Project
    • Export Project
  • LLM Projects
    • LLM Labs Introduction
    • Sandbox
      • Direct Access LLMs
      • File Attachment
      • Conversational Prompt
    • Deployment
      • Deployment API
    • Knowledge base
      • External Object Storage
      • File Properties
    • Models
      • Amazon SageMaker JumpStart
      • Amazon Bedrock
      • Open AI
      • Azure OpenAI
      • Vertex AI
      • Custom model
      • Fine-tuning
      • LLM Comparison Table
    • Evaluation
      • Automated Evaluation
        • Multi-application evaluation
        • Custom metrics
      • Ranking (RLHF)
      • Rating
      • Performance Monitoring
    • Dataset
    • Pricing Plan
  • Workspace Management
    • Workspace
    • Role & Permission
    • Analytics
      • Inter-Annotator Agreement (IAA)
        • Cohen's Kappa Calculation
        • Krippendorff's Alpha Calculation
      • Custom Report Builder
      • Project Report
      • Evaluation Metrics
    • Activity
    • File Transformer
      • Import Transformer
      • Export Transformer
      • Upload File Transformer
      • Running File Transformer
    • Label Management
      • Label Set Management
      • Question Set Management
    • Project Management
      • Self-Assignment
        • Self-Unassign
      • Transfer Assignment Ownership
      • Reset Labeling Work
      • Mark Document as Complete
      • Project Status Workflow
        • Read-only Mode
      • Comment Feature
      • Archive Project
    • Automation
      • Action: Create Projects
  • Assisted Labeling
    • ML Assisted Labeling
      • Amazon Comprehend
      • Amazon SageMaker
      • Azure ML
      • CoreNLP NER
      • CoreNLP POS
      • Custom API
      • FewNERD
      • Google Vertex AI
      • Hugging Face
      • LLM Assisted Labeling
        • Prompt Examples
        • Custom Provider
      • LLM Labs (beta)
      • NLTK
      • Sentiment Analysis
      • spaCy
      • SparkNLP NER
      • SparkNLP POS
    • Data Programming
      • Example of Labeling Functions
      • Labeling Function Analysis
      • Inter-Annotator Agreement for Data Programming
    • Predictive Labeling
  • Assisted Review
    • Label Error Detection
  • Building Your Own Model
    • Datasaur Dinamic
      • Datasaur Dinamic with Hugging Face
      • Datasaur Dinamic with Amazon SageMaker Autopilot
  • Advanced
    • Script-Generated Question
    • Shortcuts
    • Extensions
      • Labels
      • Review
      • Document and Row Labeling
      • Bounding Box Labels
      • List of Files
      • Comments
      • Analytics
      • Dictionary
      • Search
      • Labeling Guidelines
      • Metadata
      • Grammar Checker
      • ML Assisted Labeling
      • Data Programming
      • Datasaur Dinamic
      • Predictive Labeling
      • Label Error Detection
      • LLM Sandbox
    • Tokenizers
  • Integrations
    • External Object Storage
      • AWS S3
        • With IRSA
      • Google Cloud Storage
      • Azure Blob Storage
    • SAML
      • Okta
      • Microsoft Entra ID
    • SCIM
      • Okta
      • Microsoft Entra ID
    • Webhook Notifications
      • Webhook Signature
      • Events
      • Custom Headers
    • Robosaur
      • Commands
        • Create Projects
        • Apply Project Tags
        • Export Projects
        • Generate Time Per Task Report
        • Split Document
      • Storage Options
  • API
    • Datasaur APIs
    • Credentials
    • Create Project
      • New mutation (createProject)
      • Python Script Example
    • Adding Documents
    • Labeling
      • Create Label Set
      • Add Label Sets into Existing Project
      • Get List of Label Sets in a Project
      • Add Label Set Item into Project's Label Set
      • Programmatic API Labeling
      • Inserting Span and Arrow Label into Document
    • Export Project
      • Custom Webhook
    • Get Data
      • Get List of Projects
      • Get Document Information
      • Get List of Tags
      • Get Cabinet
      • Export Team Overview
      • Check Job
    • Custom OCR
      • Importable Format
    • Custom ASR
    • Run ML-Assisted Labeling
  • Security and Compliance
    • Security and Compliance
      • 2FA
  • Compatibility & Updates
    • Common Terminology
    • Recommended Machine Specifications
    • Supported Formats
    • Supported Languages
    • Release Notes
      • Version 6
        • 6.111.0
        • 6.110.0
        • 6.109.0
        • 6.108.0
        • 6.107.0
        • 6.106.0
        • 6.105.0
        • 6.104.0
        • 6.103.0
        • 6.102.0
        • 6.101.0
        • 6.100.0
        • 6.99.0
        • 6.98.0
        • 6.97.0
        • 6.96.0
        • 6.95.0
        • 6.94.0
        • 6.93.0
        • 6.92.0
        • 6.91.0
        • 6.90.0
        • 6.89.0
        • 6.88.0
        • 6.87.0
        • 6.86.0
        • 6.85.0
        • 6.84.0
        • 6.83.0
        • 6.82.0
        • 6.81.0
        • 6.80.0
        • 6.79.0
        • 6.78.0
        • 6.77.0
        • 6.76.0
        • 6.75.0
        • 6.74.0
        • 6.73.0
        • 6.72.0
        • 6.71.0
        • 6.70.0
        • 6.69.0
        • 6.68.0
        • 6.67.0
        • 6.66.0
        • 6.65.0
        • 6.64.0
        • 6.63.0
        • 6.62.0
        • 6.61.0
        • 6.60.0
        • 6.59.0
        • 6.58.0
        • 6.57.0
        • 6.56.0
        • 6.55.0
        • 6.54.0
        • 6.53.0
        • 6.52.0
        • 6.51.0
        • 6.50.0
        • 6.49.0
        • 6.48.0
        • 6.47.0
        • 6.46.0
        • 6.45.0
        • 6.44.0
        • 6.43.0
        • 6.42.0
        • 6.41.0
        • 6.40.0
        • 6.39.0
        • 6.38.0
        • 6.37.0
        • 6.36.0
        • 6.35.0
        • 6.34.0
        • 6.33.0
        • 6.32.0
        • 6.31.0
        • 6.30.0
        • 6.29.0
        • 6.28.0
        • 6.27.0
        • 6.26.0
        • 6.25.0
        • 6.24.0
        • 6.23.0
        • 6.22.0
        • 6.21.0
        • 6.20.0
        • 6.19.0
        • 6.18.0
        • 6.17.0
        • 6.16.0
        • 6.15.0
        • 6.14.0
        • 6.13.0
        • 6.12.0
        • 6.11.0
        • 6.10.0
        • 6.9.0
        • 6.8.0
        • 6.7.0
        • 6.6.0
        • 6.5.0
        • 6.4.0
        • 6.3.0
        • 6.2.0
        • 6.1.0
        • 6.0.0
      • Version 5
        • 5.63.0
        • 5.62.0
        • 5.61.0
        • 5.60.0
  • Deployment
    • Self-Hosted
      • AWS Marketplace
        • Data Studio
        • LLM Labs
Powered by GitBook
On this page
  • Model Details
  • Usage
  • References
  • Appendix
  • NLTK Treebank
  • Treebank Tagset
  • References
  1. Assisted Labeling
  2. ML Assisted Labeling

NLTK

Last updated 1 month ago

Supported Labeling Types: Span Labeling

NLTK (Natural Language Toolkit) is an open-source Python library for natural language processing (NLP). It provides tools for text preprocessing such as tokenization, stemming, lemmatization, part-of-speech tagging, and more. In the context of our labeling platform, NLTK can be integrated to support various preprocessing tasks that help improve label consistency and model training quality. Its ease of use and rich set of linguistic resources make it a useful option for preparing and analyzing text data before or during the labeling process.

Model Details

  • NLTK POS-tagging is performed using nltk.pos_tag, which internally utilizes the nltk.PerceptronTagger. This is a fast and accurate approach for part-of-speech tagging in English.

  • The underlying models are trained on the Wall Street Journal section of the Penn Treebank, providing strong performance on formal, edited text common in business documents.

  • The tagger assigns grammatical categories to words based on the UPenn Treebank Tagset, which includes categories like nouns, verbs, adjectives, adverbs, and more.

  • Fully integrated into the Datasaur Intelligence container for consistent, dependency-free operation.

Usage

  • Text preprocessing for consistency and model training improvements.

  • Supports syntactic analysis in annotation workflows.

References

Appendix

NLTK Treebank

$

dollar e.g. $, -$, --$, A$, C$, HK$, M$, NZ$, S$, U.S.$, US$

''

closing quotation mark e.g. ', ''

(

opening parenthesis e.g. (, [, {

,

comma e.g. ,

--

dash e.g. --

.

sentence terminator e.g. ., !, ?

:

colon or ellipsis e.g. :, ;, ...

``

opening quotation mark e.g. `, ``

Treebank Tagset

Tag
Description

CC

conjunction, coordinating e.g. &, 'n, and, both, but, either, et, for, less, minus, neither, nor, or, plus, so, therefore, times, v., versus, vs., whether, yet

CD

numeral, cardinal e.g. mid-1890, nine-thirty, forty-two, one-tenth, ten, million, 0.5, one, forty-, seven, 1987, twenty, '79, zero, two, 78-degrees, eighty-four, IX, '60s, .025, fifteen, 271,124, dozen, quintillion, DM2,000, ...

DT

determiner e.g. all, an, another, any, both, del, each, either, every, half, la, many, much, nary, neither, no, some, such, that, the, them, these, this, those

EX

existential there e.g. there

FW

foreign word e.g. gemeinschaft, hund, ich, jeux, habeas, Haementeria, Herr, K'ang-si, vous, lutihaw, alai, je, jour, objets, salutaris, fille, quibusdam, pas, trop, Monte, terram, fiche, oui, corporis, ...

IN

preposition or conjunction, subordinating e.g. astride, among, uppon, whether, out, inside, pro, despite, on, by, throughout, below, within, for, towards, near, behind, atop, around, if, like, until, below, next, into, if, beside, ...

JJ

adjective or numeral, ordinal e.g. third, ill-mannered, pre-war, regrettable, oiled, calamitous, first, separable, ectoplasmic, battery-powered, participatory, fourth, still-to-be-named, multilingual, multi-disciplinary, ...

JJR

adjective, comparative e.g. bleaker, braver, breezier, briefer, brighter, brisker, broader, bumper, busier, calmer, cheaper, choosier, cleaner, clearer, closer, colder, commoner, costlier, cozier, creamier, crunchier, cuter, ...

JJS

adjective, superlative e.g. calmest, cheapest, choicest, classiest, cleanest, clearest, closest, commonest, corniest, costliest, crassest, creepiest, crudest, cutest, darkest, deadliest, dearest, deepest, densest, dinkiest, ...

LS

list item marker e.g. A, A., B, B., C, C., D, E, F, First, G, H, I, J, K, One, SP-44001, SP-44002, SP-44005, SP-44007, Second, Third, Three, Two, *, a, b, c, d, first, five, four, one, six, three, two

MD

modal auxiliary e.g. can, cannot, could, couldn't, dare, may, might, must, need, ought, shall, should, shouldn't, will, would

NN

noun, common, singular or mass e.g. common-carrier, cabbage, knuckle-duster, Casino, afghan, shed, thermostat, investment, slide, humour, falloff, slick, wind, hyena, override, subhumanity, machinist, ...

NNP

noun, proper, singular e.g. Motown, Venneboerger, Czestochwa, Ranzer, Conchita, Trumplane, Christos, Oceanside, Escobar, Kreisler, Sawyer, Cougar, Yvette, Ervin, ODI, Darryl, CTCA, Shannon, A.K.C., Meltex, Liverpool, ...

NNPS

noun, proper, plural e.g. Americans, Americas, Amharas, Amityvilles, Amusements, Anarcho-Syndicalists, Andalusians, Andes, Andruses, Angels, Animals, Anthony, Antilles, Antiques, Apache, Apaches, Apocrypha, ...

NNS

noun, common, plural e.g. undergraduates, scotches, bric-a-brac, products, bodyguards, facets, coasts, divestitures, storehouses, designs, clubs, fragrances, averages, subjectivists, apprehensions, muses, factory-jobs, ...

PDT

pre-determiner e.g. all, both, half, many, quite, such, sure, this

POS

genitive marker e.g. ', 's

PRP

pronoun, personal e.g. hers, herself, him, himself, hisself, it, itself, me, myself, one, oneself, ours, ourselves, ownself, self, she, thee, theirs, them, themselves, they, thou, thy, us

PRP$

pronoun, possessive e.g. her, his, mine, my, our, ours, their, thy, your

RB

adverb e.g. occasionally, unabatingly, maddeningly, adventurously, professedly, stirringly, prominently, technologically, magisterially, predominately, swiftly, fiscally, pitilessly, ...

RBR

adverb, comparative e.g. further, gloomier, grander, graver, greater, grimmer, harder, harsher, healthier, heavier, higher, however, larger, later, leaner, lengthier, less-, perfectly, lesser, lonelier, longer, louder, lower, more, ...

RBS

adverb, superlative e.g. best, biggest, bluntest, earliest, farthest, first, furthest, hardest, heartiest, highest, largest, least, less, most, nearest, second, tightest, worst

RBS

adverb, superlative e.g. best, biggest, bluntest, earliest, farthest, first, furthest, hardest, heartiest, highest, largest, least, less, most, nearest, second, tightest, worst

RBS

particle e.g. aboard, about, across, along, apart, around, aside, at, away, back, before, behind, by, crop, down, ever, fast, for, forth, from, go, high, i.e., in, into, just, later, low, more, off, on, open, out, over, per, pie, raising, start, teeth, that, through, under, unto, up, up-pp, upon, whole, with, you

RP

particle e.g. aboard, about, across, along, apart, around, aside, at, away, back, before, behind, by, crop, down, ever, fast, for, forth, from, go, high, i.e., in, into, just, later, low, more, off, on, open, out, over, per, pie, raising, start, teeth, that, through, under, unto, up, up-pp, upon, whole, with, you

SYM

symbol e.g. %, &, ', '', ''., ), )., *, +, ,., <, =, >, @, A[fj], U.S, U.S.S.R, *, **, ***

TO

"to" as preposition or infinitive marker e.g. to

UH

interjection e.g. Goodbye, Goody, Gosh, Wow, Jeepers, Jee-sus, Hubba, Hey, Kee-reist, Oops, amen, huh, howdy, uh, dammit, whammo, shucks, heck, anyways, whodunnit, honey, golly, man, baby, diddle, hush, sonuvabitch, ...

VB

verb, base form e.g. ask, assemble, assess, assign, assume, atone, attention, avoid, bake, balkanize, bank, begin, behold, believe, bend, benefit, bevel, beware, bless, boil, bomb, boost, brace, break, bring, broil, brush, build, ...

VBD

verb, base form e.g. ask, assemble, assess, assign, assume, atone, attention, avoid, bake, balkanize, bank, begin, behold, believe, bend, benefit, bevel, beware, bless, boil, bomb, boost, brace, break, bring, broil, brush, build, ...

VBG

verb, present participle or gerund e.g. telegraphing, stirring, focusing, angering, judging, stalling, lactating, hankerin', alleging, veering, capping, approaching, traveling, besieging, encrypting, interrupting, erasing, wincing, ...

VBN

verb, past participle e.g. multihulled, dilapidated, aerosolized, chaired, languished, panelized, used, experimented, flourished, imitated, reunifed, factored, condensed, sheared, unsettled, primed, dubbed, desired, ...

VBP

verb, present tense, not 3rd person singular e.g. predominate, wrap, resort, sue, twist, spill, cure, lengthen, brush, terminate, appear, tend, stray, glisten, obtain, comprise, detest, tease, attract, emphasize, mold, postpone, sever, return, wag, ...

VBZ

verb, present tense, 3rd person singular e.g. bases, reconstructs, marks, mixes, displeases, seals, carps, weaves, snatches, slumps, stretches, authorizes, smolders, pictures, emerges, stockpiles, seduces, fizzes, uses, bolsters, slaps, speaks, pleads, ...

WDT

WH-determiner e.g. that, what, whatever, which, whichever

WP

WH-determiner e.g. that, what, whatever, which, whichever

WP$

WH-pronoun, possessive e.g. whose

WRB

WH-adverb e.g. how, however, whence, whenever, where, whereby, whereever, wherein, whereof, why

References

  • python -c "import nltk; nltk.help.upenn_tagset()"

Greedy Averaged Perceptron tagger, as implemented by Matthew Honnibal. See more implementation details here: >

Tag set:

The detailed available

UPenn Treebank Docs

https://explosion.ai/blog/part-of-speech-pos-tagger-in-python
UPenn Treebank Tag Set
here
https://www.nltk.org/book/ch05.html
https://www.nltk.org/api/nltk.tag.html
https://www.nltk.org/api/nltk.tag.perceptron.html
https://catalog.ldc.upenn.edu/docs/LDC99T42/
ML Assisted with NLTK
Image of ML Assisted with NLTK