Datasaur
Visit our websitePricingBlogPlaygroundAPI Docs
  • Welcome to Datasaur
    • Getting started with Datasaur
  • Data Studio Projects
    • Labeling Task Types
      • Span Based
        • OCR Labeling
        • Audio Project
      • Row Based
      • Document Based
      • Bounding Box
      • Conversational
      • Mixed Labeling
      • Project Templates
        • Test Project
    • Creating a Project
      • Data Formats
      • Data Samples
      • Split Files
      • Consensus
      • Dynamic Review Capabilities
    • Pre-Labeled Project
    • Let's Get Labeling!
      • Span Based
        • Span + Line Labeling
      • Row & Document Based
      • Bounding Box Labeling
      • Conversational Labeling
      • Label Sets / Question Sets
        • Dynamic Question Set
      • Multiple Label Sets
    • Reviewing Projects
      • Review Sampling
    • Adding Documents to an Ongoing Project
    • Export Project
  • LLM Projects
    • LLM Labs Introduction
    • Sandbox
      • Direct Access LLMs
      • File Attachment
      • Conversational Prompt
    • Deployment
      • Deployment API
    • Knowledge base
      • External Object Storage
      • File Properties
    • Models
      • Amazon SageMaker JumpStart
      • Amazon Bedrock
      • Open AI
      • Azure OpenAI
      • Vertex AI
      • Custom model
      • Fine-tuning
      • LLM Comparison Table
    • Evaluation
      • Automated Evaluation
        • Multi-application evaluation
        • Custom metrics
      • Ranking (RLHF)
      • Rating
      • Performance Monitoring
    • Dataset
    • Pricing Plan
  • Workspace Management
    • Workspace
    • Role & Permission
    • Analytics
      • Inter-Annotator Agreement (IAA)
        • Cohen's Kappa Calculation
        • Krippendorff's Alpha Calculation
      • Custom Report Builder
      • Project Report
      • Evaluation Metrics
    • Activity
    • File Transformer
      • Import Transformer
      • Export Transformer
      • Upload File Transformer
      • Running File Transformer
    • Label Management
      • Label Set Management
      • Question Set Management
    • Project Management
      • Self-Assignment
        • Self-Unassign
      • Transfer Assignment Ownership
      • Reset Labeling Work
      • Mark Document as Complete
      • Project Status Workflow
        • Read-only Mode
      • Comment Feature
      • Archive Project
    • Automation
      • Action: Create Projects
  • Assisted Labeling
    • ML Assisted Labeling
      • Amazon Comprehend
      • Amazon SageMaker
      • Azure ML
      • CoreNLP NER
      • CoreNLP POS
      • Custom API
      • FewNERD
      • Google Vertex AI
      • Hugging Face
      • LLM Assisted Labeling
        • Prompt Examples
        • Custom Provider
      • LLM Labs (beta)
      • NLTK
      • Sentiment Analysis
      • spaCy
      • SparkNLP NER
      • SparkNLP POS
    • Data Programming
      • Example of Labeling Functions
      • Labeling Function Analysis
      • Inter-Annotator Agreement for Data Programming
    • Predictive Labeling
  • Assisted Review
    • Label Error Detection
  • Building Your Own Model
    • Datasaur Dinamic
      • Datasaur Dinamic with Hugging Face
      • Datasaur Dinamic with Amazon SageMaker Autopilot
  • Advanced
    • Script-Generated Question
    • Shortcuts
    • Extensions
      • Labels
      • Review
      • Document and Row Labeling
      • Bounding Box Labels
      • List of Files
      • Comments
      • Analytics
      • Dictionary
      • Search
      • Labeling Guidelines
      • Metadata
      • Grammar Checker
      • ML Assisted Labeling
      • Data Programming
      • Datasaur Dinamic
      • Predictive Labeling
      • Label Error Detection
      • LLM Sandbox
    • Tokenizers
  • Integrations
    • External Object Storage
      • AWS S3
        • With IRSA
      • Google Cloud Storage
      • Azure Blob Storage
      • Dropbox
    • SAML
      • Okta
      • Microsoft Entra ID
    • SCIM
      • Okta
      • Microsoft Entra ID
    • Webhook Notifications
      • Webhook Signature
      • Events
      • Custom Headers
    • Robosaur
      • Commands
        • Create Projects
        • Apply Project Tags
        • Export Projects
        • Generate Time Per Task Report
        • Split Document
      • Storage Options
  • API
    • Datasaur APIs
    • Credentials
    • Create Project
      • New mutation (createProject)
      • Python Script Example
    • Adding Documents
    • Labeling
      • Create Label Set
      • Add Label Sets into Existing Project
      • Get List of Label Sets in a Project
      • Add Label Set Item into Project's Label Set
      • Programmatic API Labeling
      • Inserting Span and Arrow Label into Document
    • Export Project
      • Custom Webhook
    • Get Data
      • Get List of Projects
      • Get Document Information
      • Get List of Tags
      • Get Cabinet
      • Export Team Overview
      • Check Job
    • Custom OCR
      • Importable Format
    • Custom ASR
    • Run ML-Assisted Labeling
  • Security and Compliance
    • Security and Compliance
      • 2FA
  • Compatibility & Updates
    • Common Terminology
    • Recommended Machine Specifications
    • Supported Formats
    • Supported Languages
    • Release Notes
      • Version 6
        • 6.112.0
        • 6.111.0
        • 6.110.0
        • 6.109.0
        • 6.108.0
        • 6.107.0
        • 6.106.0
        • 6.105.0
        • 6.104.0
        • 6.103.0
        • 6.102.0
        • 6.101.0
        • 6.100.0
        • 6.99.0
        • 6.98.0
        • 6.97.0
        • 6.96.0
        • 6.95.0
        • 6.94.0
        • 6.93.0
        • 6.92.0
        • 6.91.0
        • 6.90.0
        • 6.89.0
        • 6.88.0
        • 6.87.0
        • 6.86.0
        • 6.85.0
        • 6.84.0
        • 6.83.0
        • 6.82.0
        • 6.81.0
        • 6.80.0
        • 6.79.0
        • 6.78.0
        • 6.77.0
        • 6.76.0
        • 6.75.0
        • 6.74.0
        • 6.73.0
        • 6.72.0
        • 6.71.0
        • 6.70.0
        • 6.69.0
        • 6.68.0
        • 6.67.0
        • 6.66.0
        • 6.65.0
        • 6.64.0
        • 6.63.0
        • 6.62.0
        • 6.61.0
        • 6.60.0
        • 6.59.0
        • 6.58.0
        • 6.57.0
        • 6.56.0
        • 6.55.0
        • 6.54.0
        • 6.53.0
        • 6.52.0
        • 6.51.0
        • 6.50.0
        • 6.49.0
        • 6.48.0
        • 6.47.0
        • 6.46.0
        • 6.45.0
        • 6.44.0
        • 6.43.0
        • 6.42.0
        • 6.41.0
        • 6.40.0
        • 6.39.0
        • 6.38.0
        • 6.37.0
        • 6.36.0
        • 6.35.0
        • 6.34.0
        • 6.33.0
        • 6.32.0
        • 6.31.0
        • 6.30.0
        • 6.29.0
        • 6.28.0
        • 6.27.0
        • 6.26.0
        • 6.25.0
        • 6.24.0
        • 6.23.0
        • 6.22.0
        • 6.21.0
        • 6.20.0
        • 6.19.0
        • 6.18.0
        • 6.17.0
        • 6.16.0
        • 6.15.0
        • 6.14.0
        • 6.13.0
        • 6.12.0
        • 6.11.0
        • 6.10.0
        • 6.9.0
        • 6.8.0
        • 6.7.0
        • 6.6.0
        • 6.5.0
        • 6.4.0
        • 6.3.0
        • 6.2.0
        • 6.1.0
        • 6.0.0
      • Version 5
        • 5.63.0
        • 5.62.0
        • 5.61.0
        • 5.60.0
  • Deployment
    • Self-Hosted
      • AWS Marketplace
        • Data Studio
        • LLM Labs
Powered by GitBook
On this page
  • Named Entity Recognition
  • Part of Speech
  • Coreference
  • Dependency
  • Document Labeling
  • Optical Character Recognition (OCR)
  • Create your own template!
  1. Data Studio Projects
  2. Labeling Task Types

Project Templates

Exploring Datasaur's pre-built project templates

Last updated 7 months ago

Just like Microsoft Word or PowerPoint have templates, Datasaur has project templates that allow you to quickly get started with pre-built settings. Let's explore each one in turn.

Named Entity Recognition

Named Entity Recognition (NER) is also referred to as Named Entity Extraction. It describes the process of identifying and extracting specific entities in a text. These entities will be classified into various predefined categories that can represent real-world objects, such as places, organizations, names, locations, etc.

Named entities are not always single tokens. In the example above, The Strand Magazine is an entity that has two tokens. Multi-token labeling is common in NER labeling.

Part of Speech

Part of Speech (POS) tagging is the process of labeling each word in a text with its part of speech based on the context of the sentence. Once we define the role of each word, it will be useful for training the algorithm to understand the structure and meaning of a sentence.

You are welcome to define your own parts of speech for labeling.

Coreference

Coreference resolution is the task of identifying all expressions that refer to the same entity in a text. This kind of task can be beneficial for many applications, including information extraction, text summarization, question answering, and machine translation.

Coreference resolution usually includes nouns, noun phrases, proper nouns, and pronouns. We can see that his is a pronoun and refers to Sherlock Holmes, which is a noun phrase. Coreference resolution helps eliminate ambiguity in deciphering a document. It often requires labeling phrases first, then drawing arrows from one to another.

Dependency

Dependency parsing is the task of labeling relations between words. These relations consist of a head and a dependent. Please consider the example below. Sherlock is the subject of the verb became.

Document Labeling

Document labeling is the task of classifying and categorizing data. This type of labeling is different from the types discussed above, because the labeler is answering questions about the text, rather than labeling spans of tokens within the text. It can be beneficial for projects such as sentiment analysis or applying metadata to a document.

In Datasaur, document labeling can be done on a per-row basis or on a per-document basis.

You can also label images, .pdf, and even .gif in document labeling.

Note: when create projects via the DOC project template, the following settings will apply as a default.

  • Any uploaded questions will be set as required.

  • Answer sets can be edited. If the labeler types something in the text box that does not match an existing label, she can click "Add <your answer> as a new answer".

Optical Character Recognition (OCR)

Note: when uploading pairs of OCR documents, please make sure your image files and their corresponding transcription have the same file name. For example, SEC.pdf and SEC.txt.

Create your own template!

After successfully sign up, every new user will see Datasaur built-in project templates on their workspace. Now, they are allowed to create their own template!

The first step is creating a project with all setting has been set. Then, click triple-dots on the project and choose Save as template. You can rename the template as you want, and even upload an avatar.

If you are in a team workspace and need a custom script for project creation, this will surely save your time!

Note: one standard industry practice for English POS is to follow the parts of speech as defined by the .

Note: for Dependency label sets, common industry best practices include or the .

Optical Character Recognition (OCR) is the task of translating text inside images or scanned documents into machine-readable text data. Some common use applications of OCR include invoices, receipts, or legal documents. . In the video above, the user creates an OCR project by uploading the original document with the corresponding transcription via a txt file. However, your workspace can integrate with an OCR technology. Datasaur is agnostic toward which OCR provider you believe is best for your documents. Please let us know which OCR provider you have in mind by contacting us at Support@datasaur.ai. By integrating an OCR provider, the technology will create the transcriptions within Project Creation process (specifically in step 2).

Penn Treebank Project
Universal Dependencies
Stanford typed dependencies Manual
Watch this video on Youtube for how instructions on how to create an OCR project
836B
Datasaur sample - NER (1).tsv
Sherlock Holmes
1KB
Datasaur sample - POS.tsv
Sherlock Holmes
525B
Datasaur sample - COREF.txt
Sherlock Holmes
3KB
Datasaur sample - DEP.conllu
5KB
Datasaur sample - DOC (1).xlsx
Book Review
1MB
Datasaur sample - DOC (image).zip
archive
Book Cover
Project Template