Datasaur
Visit our websitePricingBlogPlaygroundAPI Docs
  • Welcome to Datasaur
    • Getting started with Datasaur
  • Data Studio Projects
    • Labeling Task Types
      • Span Based
        • OCR Labeling
        • Audio Project
      • Row Based
      • Document Based
      • Bounding Box
      • Conversational
      • Mixed Labeling
      • Project Templates
        • Test Project
    • Creating a Project
      • Data Formats
      • Data Samples
      • Split Files
      • Consensus
      • Dynamic Review Capabilities
    • Pre-Labeled Project
    • Let's Get Labeling!
      • Span Based
        • Span + Line Labeling
      • Row & Document Based
      • Bounding Box Labeling
      • Conversational Labeling
      • Label Sets / Question Sets
        • Dynamic Question Set
      • Multiple Label Sets
    • Reviewing Projects
      • Review Sampling
    • Adding Documents to an Ongoing Project
    • Export Project
  • LLM Projects
    • LLM Labs Introduction
    • Sandbox
      • Direct Access LLMs
      • File Attachment
      • Conversational Prompt
    • Deployment
      • Deployment API
    • Knowledge base
      • External Object Storage
      • File Properties
    • Models
      • Amazon SageMaker JumpStart
      • Amazon Bedrock
      • Open AI
      • Azure OpenAI
      • Vertex AI
      • Custom model
      • Fine-tuning
      • LLM Comparison Table
    • Evaluation
      • Automated Evaluation
        • Multi-application evaluation
        • Custom metrics
      • Ranking (RLHF)
      • Rating
      • Performance Monitoring
    • Dataset
    • Pricing Plan
  • Workspace Management
    • Workspace
    • Role & Permission
    • Analytics
      • Inter-Annotator Agreement (IAA)
        • Cohen's Kappa Calculation
        • Krippendorff's Alpha Calculation
      • Custom Report Builder
      • Project Report
      • Evaluation Metrics
    • Activity
    • File Transformer
      • Import Transformer
      • Export Transformer
      • Upload File Transformer
      • Running File Transformer
    • Label Management
      • Label Set Management
      • Question Set Management
    • Project Management
      • Self-Assignment
        • Self-Unassign
      • Transfer Assignment Ownership
      • Reset Labeling Work
      • Mark Document as Complete
      • Project Status Workflow
        • Read-only Mode
      • Comment Feature
      • Archive Project
    • Automation
      • Action: Create Projects
  • Assisted Labeling
    • ML Assisted Labeling
      • Amazon Comprehend
      • Amazon SageMaker
      • Azure ML
      • CoreNLP NER
      • CoreNLP POS
      • Custom API
      • FewNERD
      • Google Vertex AI
      • Hugging Face
      • LLM Assisted Labeling
        • Prompt Examples
        • Custom Provider
      • LLM Labs (beta)
      • NLTK
      • Sentiment Analysis
      • spaCy
      • SparkNLP NER
      • SparkNLP POS
    • Data Programming
      • Example of Labeling Functions
      • Labeling Function Analysis
      • Inter-Annotator Agreement for Data Programming
    • Predictive Labeling
  • Assisted Review
    • Label Error Detection
  • Building Your Own Model
    • Datasaur Dinamic
      • Datasaur Dinamic with Hugging Face
      • Datasaur Dinamic with Amazon SageMaker Autopilot
  • Advanced
    • Script-Generated Question
    • Shortcuts
    • Extensions
      • Labels
      • Review
      • Document and Row Labeling
      • Bounding Box Labels
      • List of Files
      • Comments
      • Analytics
      • Dictionary
      • Search
      • Labeling Guidelines
      • Metadata
      • Grammar Checker
      • ML Assisted Labeling
      • Data Programming
      • Datasaur Dinamic
      • Predictive Labeling
      • Label Error Detection
      • LLM Sandbox
    • Tokenizers
  • Integrations
    • External Object Storage
      • AWS S3
        • With IRSA
      • Google Cloud Storage
      • Azure Blob Storage
    • SAML
      • Okta
      • Microsoft Entra ID
    • SCIM
      • Okta
      • Microsoft Entra ID
    • Webhook Notifications
      • Webhook Signature
      • Events
      • Custom Headers
    • Robosaur
      • Commands
        • Create Projects
        • Apply Project Tags
        • Export Projects
        • Generate Time Per Task Report
        • Split Document
      • Storage Options
  • API
    • Datasaur APIs
    • Credentials
    • Create Project
      • New mutation (createProject)
      • Python Script Example
    • Adding Documents
    • Labeling
      • Create Label Set
      • Add Label Sets into Existing Project
      • Get List of Label Sets in a Project
      • Add Label Set Item into Project's Label Set
      • Programmatic API Labeling
      • Inserting Span and Arrow Label into Document
    • Export Project
      • Custom Webhook
    • Get Data
      • Get List of Projects
      • Get Document Information
      • Get List of Tags
      • Get Cabinet
      • Export Team Overview
      • Check Job
    • Custom OCR
      • Importable Format
    • Custom ASR
    • Run ML-Assisted Labeling
  • Security and Compliance
    • Security and Compliance
      • 2FA
  • Compatibility & Updates
    • Common Terminology
    • Recommended Machine Specifications
    • Supported Formats
    • Supported Languages
    • Release Notes
      • Version 6
        • 6.111.0
        • 6.110.0
        • 6.109.0
        • 6.108.0
        • 6.107.0
        • 6.106.0
        • 6.105.0
        • 6.104.0
        • 6.103.0
        • 6.102.0
        • 6.101.0
        • 6.100.0
        • 6.99.0
        • 6.98.0
        • 6.97.0
        • 6.96.0
        • 6.95.0
        • 6.94.0
        • 6.93.0
        • 6.92.0
        • 6.91.0
        • 6.90.0
        • 6.89.0
        • 6.88.0
        • 6.87.0
        • 6.86.0
        • 6.85.0
        • 6.84.0
        • 6.83.0
        • 6.82.0
        • 6.81.0
        • 6.80.0
        • 6.79.0
        • 6.78.0
        • 6.77.0
        • 6.76.0
        • 6.75.0
        • 6.74.0
        • 6.73.0
        • 6.72.0
        • 6.71.0
        • 6.70.0
        • 6.69.0
        • 6.68.0
        • 6.67.0
        • 6.66.0
        • 6.65.0
        • 6.64.0
        • 6.63.0
        • 6.62.0
        • 6.61.0
        • 6.60.0
        • 6.59.0
        • 6.58.0
        • 6.57.0
        • 6.56.0
        • 6.55.0
        • 6.54.0
        • 6.53.0
        • 6.52.0
        • 6.51.0
        • 6.50.0
        • 6.49.0
        • 6.48.0
        • 6.47.0
        • 6.46.0
        • 6.45.0
        • 6.44.0
        • 6.43.0
        • 6.42.0
        • 6.41.0
        • 6.40.0
        • 6.39.0
        • 6.38.0
        • 6.37.0
        • 6.36.0
        • 6.35.0
        • 6.34.0
        • 6.33.0
        • 6.32.0
        • 6.31.0
        • 6.30.0
        • 6.29.0
        • 6.28.0
        • 6.27.0
        • 6.26.0
        • 6.25.0
        • 6.24.0
        • 6.23.0
        • 6.22.0
        • 6.21.0
        • 6.20.0
        • 6.19.0
        • 6.18.0
        • 6.17.0
        • 6.16.0
        • 6.15.0
        • 6.14.0
        • 6.13.0
        • 6.12.0
        • 6.11.0
        • 6.10.0
        • 6.9.0
        • 6.8.0
        • 6.7.0
        • 6.6.0
        • 6.5.0
        • 6.4.0
        • 6.3.0
        • 6.2.0
        • 6.1.0
        • 6.0.0
      • Version 5
        • 5.63.0
        • 5.62.0
        • 5.61.0
        • 5.60.0
  • Deployment
    • Self-Hosted
      • AWS Marketplace
        • Data Studio
        • LLM Labs
Powered by GitBook
On this page
  • Example (with span and arrow label type)
  • Explanation
  • Example (with bounding box label type)
  1. API
  2. Custom OCR

Importable Format

Last updated 1 year ago

Importable format is a JSON format which is used to import data to Datasaur project.

A Importable JSON format may contain the following data structures:

  1. type: the Importable type (which identified with value "BOUNDING_BOX")

  2. cells: an array containing the intersection of a row and a column

    1. content: a sentence in the cell

    2. index: the column index for the cell

    3. line: the row index for the cell

    4. metadata: additional information for a cell

    5. tokens: array of strings to define custom tokenization

  3. labelSets: the label set which is used by the project.

  4. labels: an array of labels

    1. Span label type

      1. id: A unique number

      2. type: Identified with value "SPAN"

      3. startCellLine, startCellIndex, startTokenIndex, startCharIndex. The starting position of the span label. Please refer to for a thorough explanation of how cells, tokens, and characters are positioned.

      4. endCellLine, endCellIndex, endTokenIndex, endCharIndex. The ending position of the span label. Please refer to for a thorough explanation of how cells, tokens, and characters are positioned.

      5. layer: The label set index to which this label belongs.

      6. counter: The index (0-based) of the label that is applied to the current position. If there is only one label at this position, then its value is 0.

      7. labelSetItemId: The ID of this label within the label set.

    2. Arrow label type

      1. id: A unique number

      2. type: Identified with value "ARROW"

      3. originId: The id of the label from which this arrow starts.

      4. destinationId: The id of the label to which this arrow ends.

      5. startCellLine, startCellIndex, startTokenIndex, startCharIndex. This is the same as the origin label's.

      6. endCellLine, endCellIndex, endTokenIndex, endCharIndex. This is the same as the destination label's.

      7. layer: The label set index to which this label belongs.

      8. counter: The index (0-based) of the label that is applied to the current position. If there is only one label at this position, then its value is 0.

      9. labelSetItemId: The ID of this label within the label set. Use an empty string if this arrow does not have a label.

    3. Bounding box label type

      1. type: identified with value "BOUNDING_BOX"

      2. startCellLine: starting line sentence position

      3. startCellIndex: starting line column position

      4. startTokenIndex: starting token index position

      5. startCharIndex: starting character index position (relative to tokenIndex, start from 0 again when tokenIndex incremented)

      6. endCellLine: ending line sentence position

      7. endCellIndex: ending line column position

      8. endTokenIndex: ending token index position

      9. endCharIndex: ending character index position

      10. layer: the layer where the token is positioned

      11. counter: The index (0-based) of the label that is applied to the current position. If there is only one label at this position, then its value is 0.

      12. pageIndex: index of the page if the document contain multiple pages

      13. nodeCount: total number of the bounding box points

      14. x0: x coordinate of top left position of the bounding box

      15. y0: y coordinate of top left position of the bounding box

      16. x1: x coordinate of top right position of the bounding box

      17. y1: y coordinate of top right position of the bounding box

      18. x2: x coordinate of bottom right position of the bounding box

      19. y2: y coordinate of bottom right position of the bounding box

      20. x3: x coordinate of bottom left position of the bounding box

      21. y3: y coordinate of bottom left position of the bounding box

  5. pages: an array of page information

    1. pageIndex: index of the page if the document contain multiple pages

    2. pageHeight: original page height in pixel

    3. pageWidth: original page width in pixel

Example (with span and arrow label type)

{
  "cells": [
    {
      "content": "The quick brown fox jumps over the lazy dog",
      "tokens": ["The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"],
      "line": 0,
      "index": 0,
      "metadata": []
    }
  ],
  "labelSets": [
    {
      "name": "Subjects",
      "index": 0,
      "labelItems": [
        { "id": "FOX_ID", "labelName": "Fox" },
        { "id": "DOG_ID", "labelName": "Dog" }
      ]
    },
    {
      "name": "Verbs",
      "index": 1,
      "labelItems": [
        { "id": "JUMP_ID", "labelName": "Jump" }
      ]
    }
  ],
  "labels": [
    {
      "id": 1,
      "type": "SPAN",
      "startCellLine": 0,
      "startCellIndex": 0,
      "startTokenIndex": 1,
      "startCharIndex": 0,
      "endCellLine": 0,
      "endCellIndex": 0,
      "endTokenIndex": 3,
      "endCharIndex": 2,
      "layer": 0,
      "counter": 0,
      "labelSetItemId": "FOX_ID"
    },
    {
      "id": 2,
      "type": "SPAN",
      "startCellLine": 0,
      "startCellIndex": 0,
      "startTokenIndex": 6,
      "startCharIndex": 0,
      "endCellLine": 0,
      "endCellIndex": 0,
      "endTokenIndex": 8,
      "endCharIndex": 2,
      "layer": 0,
      "counter": 0,
      "labelSetItemId": "DOG_ID"
    },
    {
      "id": 3,
      "originId": 1,
      "destinationId": 2,
      "type": "ARROW",
      "startCellLine": 0,
      "startCellIndex": 0,
      "startTokenIndex": 1,
      "startCharIndex": 0,
      "endCellLine": 0,
      "endCellIndex": 0,
      "endTokenIndex": 8,
      "endCharIndex": 2,
      "layer": 1,
      "counter": 0,
      "labelSetItemId": "JUMP_ID"
    }
  ],
  "name": "Example"
}

The above JSON will produce the following output:

Explanation

The above project contains 1 cell and 3 labels (2 span labels and 1 arrow label) from 2 different label sets.

Defining the Cells

Refer to this for a more thorough explanati

There is only one Cell in the project, and for that Cell we set:

  • content as The quick brown fox jumps over the lazy dog.

  • tokens as ["The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"], because we use a simple whitespace tokenization. You can define your own tokenization method here.

  • line as 0, since the cell is on the first row.

  • index as 0, since the cell is on the left most column.

  • metadata as [], since we don't display any metadata.

Defining the Label Sets

Defining the Labels

The first label (Fox):

  • starts from the first character of the second token in the first cell, hence we set:

    • startCellLine as 0, since the first cell's Line is 0.

    • startCellIndex as 0, since the first cell's Index is 0.

    • startTokenIndex as 1, since the label starts from the second token.

    • startCharIndex as 0, since the label starts from the first character of the token.

  • ends at the third character of the fourth token in the first cell, hence we set:

    • endCellLine as 0

    • endCellIndex as 0

    • endTokenIndex as 3

    • endCharIndex as 2

  • comes from the first Label Set, hence we set:

    • layer as 0

    • labelSetItemId as FOX_ID

The second label (Dog) JSON follows the same idea of first label (Fox).

The third label (Jump):

  • It originates from the first label (Fox) and ends at the second label (Dog), hence we set:

    • originId as 1, because the first label has id 1.

    • destinationId as 2, because the second label has id 2

  • startCellLine, startCellIndex, startTokenIndex, startCharIndex are the same with the first label's (Fox).

  • endCellLine, endCellIndex, endTokenIndex, endCharIndex are the same with the second label's (Dog).

  • comes from the second Label Set, hence we set:

    • layer as 0

    • labelSetItemId as JUMP_ID

Example (with bounding box label type)

{
  "cells": [
    {
      "content": "SHIHLIN TAIWAN",
      "index": 0,
      "line": 0,
      "metadata": [],
      "tokens": [
        "SHIHLIN",
        "TAIWAN"
      ]
    },
    {
      "content": "STREET SNACKS",
      "index": 0,
      "line": 1,
      "metadata": [],
      "tokens": [
        "STREET",
        "SNACKS"
      ]
    }
  ],
  "labelSets": [],
  "labels": [
    {
      "startCellLine": 0,
      "startCellIndex": 0,
      "startTokenIndex": 0,
      "startCharIndex": 0,
      "endCellLine": 0,
      "endCellIndex": 0,
      "endTokenIndex": 0,
      "endCharIndex": 6,
      "layer": 0,
      "counter": 0,
      "pageIndex": 0,
      "type": "BOUNDING_BOX",
      "nodeCount": 4,
      "x0": 130,
      "y0": 154,
      "x1": 255,
      "y1": 154,
      "x2": 255,
      "y2": 186,
      "x3": 130,
      "y3": 186
    },
    {
      "startCellLine": 0,
      "startCellIndex": 0,
      "startTokenIndex": 1,
      "startCharIndex": 0,
      "endCellLine": 0,
      "endCellIndex": 0,
      "endTokenIndex": 1,
      "endCharIndex": 5,
      "layer": 0,
      "counter": 0,
      "pageIndex": 0,
      "type": "BOUNDING_BOX",
      "nodeCount": 4,
      "x0": 261,
      "y0": 154,
      "x1": 375,
      "y1": 154,
      "x2": 375,
      "y2": 186,
      "x3": 261,
      "y3": 186
    }
  ],
  "name": "receipt.jpg",
  "pages": [
    {
      "pageIndex": 0,
      "pageHeight": 619,
      "pageWidth": 551
    }
  ],
  "type": "BOUNDING_BOX"
}

There will be 2 label sets named Subjects (indexed 0) and Verbs (indexed 1), please refer to for thorough explanation of Label Set Index. Label Items within each Label Set must have unique id, for example in Subjects Label Set we have FOX_ID and DOG_ID.

this
this
this