Datasaur
Visit our websitePricingBlogPlaygroundAPI Docs
  • Welcome to Datasaur
    • Getting started with Datasaur
  • Data Studio Projects
    • Labeling Task Types
      • Span Based
        • OCR Labeling
        • Audio Project
      • Row Based
      • Document Based
      • Bounding Box
      • Conversational
      • Mixed Labeling
      • Project Templates
        • Test Project
    • Creating a Project
      • Data Formats
      • Data Samples
      • Split Files
      • Consensus
      • Dynamic Review Capabilities
    • Pre-Labeled Project
    • Let's Get Labeling!
      • Span Based
        • Span + Line Labeling
      • Row & Document Based
      • Bounding Box Labeling
      • Conversational Labeling
      • Label Sets / Question Sets
        • Dynamic Question Set
      • Multiple Label Sets
    • Reviewing Projects
      • Review Sampling
    • Adding Documents to an Ongoing Project
    • Export Project
  • LLM Projects
    • LLM Labs Introduction
    • Sandbox
      • Direct Access LLMs
      • File Attachment
      • Conversational Prompt
    • Deployment
      • Deployment API
    • Knowledge base
      • External Object Storage
      • File Properties
    • Models
      • Amazon SageMaker JumpStart
      • Amazon Bedrock
      • Open AI
      • Azure OpenAI
      • Vertex AI
      • Custom model
      • Fine-tuning
      • LLM Comparison Table
    • Evaluation
      • Automated Evaluation
        • Multi-application evaluation
        • Custom metrics
      • Ranking (RLHF)
      • Rating
      • Performance Monitoring
    • Dataset
    • Pricing Plan
  • Workspace Management
    • Workspace
    • Role & Permission
    • Analytics
      • Inter-Annotator Agreement (IAA)
        • Cohen's Kappa Calculation
        • Krippendorff's Alpha Calculation
      • Custom Report Builder
      • Project Report
      • Evaluation Metrics
    • Activity
    • File Transformer
      • Import Transformer
      • Export Transformer
      • Upload File Transformer
      • Running File Transformer
    • Label Management
      • Label Set Management
      • Question Set Management
    • Project Management
      • Self-Assignment
        • Self-Unassign
      • Transfer Assignment Ownership
      • Reset Labeling Work
      • Mark Document as Complete
      • Project Status Workflow
        • Read-only Mode
      • Comment Feature
      • Archive Project
    • Automation
      • Action: Create Projects
  • Assisted Labeling
    • ML Assisted Labeling
      • Amazon Comprehend
      • Amazon SageMaker
      • Azure ML
      • CoreNLP NER
      • CoreNLP POS
      • Custom API
      • FewNERD
      • Google Vertex AI
      • Hugging Face
      • LLM Assisted Labeling
        • Prompt Examples
        • Custom Provider
      • LLM Labs (beta)
      • NLTK
      • Sentiment Analysis
      • spaCy
      • SparkNLP NER
      • SparkNLP POS
    • Data Programming
      • Example of Labeling Functions
      • Labeling Function Analysis
      • Inter-Annotator Agreement for Data Programming
    • Predictive Labeling
  • Assisted Review
    • Label Error Detection
  • Building Your Own Model
    • Datasaur Dinamic
      • Datasaur Dinamic with Hugging Face
      • Datasaur Dinamic with Amazon SageMaker Autopilot
  • Advanced
    • Script-Generated Question
    • Shortcuts
    • Extensions
      • Labels
      • Review
      • Document and Row Labeling
      • Bounding Box Labels
      • List of Files
      • Comments
      • Analytics
      • Dictionary
      • Search
      • Labeling Guidelines
      • Metadata
      • Grammar Checker
      • ML Assisted Labeling
      • Data Programming
      • Datasaur Dinamic
      • Predictive Labeling
      • Label Error Detection
      • LLM Sandbox
    • Tokenizers
  • Integrations
    • External Object Storage
      • AWS S3
        • With IRSA
      • Google Cloud Storage
      • Azure Blob Storage
    • SAML
      • Okta
      • Microsoft Entra ID
    • SCIM
      • Okta
      • Microsoft Entra ID
    • Webhook Notifications
      • Webhook Signature
      • Events
      • Custom Headers
    • Robosaur
      • Commands
        • Create Projects
        • Apply Project Tags
        • Export Projects
        • Generate Time Per Task Report
        • Split Document
      • Storage Options
  • API
    • Datasaur APIs
    • Credentials
    • Create Project
      • New mutation (createProject)
      • Python Script Example
    • Adding Documents
    • Labeling
      • Create Label Set
      • Add Label Sets into Existing Project
      • Get List of Label Sets in a Project
      • Add Label Set Item into Project's Label Set
      • Programmatic API Labeling
      • Inserting Span and Arrow Label into Document
    • Export Project
      • Custom Webhook
    • Get Data
      • Get List of Projects
      • Get Document Information
      • Get List of Tags
      • Get Cabinet
      • Export Team Overview
      • Check Job
    • Custom OCR
      • Importable Format
    • Custom ASR
    • Run ML-Assisted Labeling
  • Security and Compliance
    • Security and Compliance
      • 2FA
  • Compatibility & Updates
    • Common Terminology
    • Recommended Machine Specifications
    • Supported Formats
    • Supported Languages
    • Release Notes
      • Version 6
        • 6.111.0
        • 6.110.0
        • 6.109.0
        • 6.108.0
        • 6.107.0
        • 6.106.0
        • 6.105.0
        • 6.104.0
        • 6.103.0
        • 6.102.0
        • 6.101.0
        • 6.100.0
        • 6.99.0
        • 6.98.0
        • 6.97.0
        • 6.96.0
        • 6.95.0
        • 6.94.0
        • 6.93.0
        • 6.92.0
        • 6.91.0
        • 6.90.0
        • 6.89.0
        • 6.88.0
        • 6.87.0
        • 6.86.0
        • 6.85.0
        • 6.84.0
        • 6.83.0
        • 6.82.0
        • 6.81.0
        • 6.80.0
        • 6.79.0
        • 6.78.0
        • 6.77.0
        • 6.76.0
        • 6.75.0
        • 6.74.0
        • 6.73.0
        • 6.72.0
        • 6.71.0
        • 6.70.0
        • 6.69.0
        • 6.68.0
        • 6.67.0
        • 6.66.0
        • 6.65.0
        • 6.64.0
        • 6.63.0
        • 6.62.0
        • 6.61.0
        • 6.60.0
        • 6.59.0
        • 6.58.0
        • 6.57.0
        • 6.56.0
        • 6.55.0
        • 6.54.0
        • 6.53.0
        • 6.52.0
        • 6.51.0
        • 6.50.0
        • 6.49.0
        • 6.48.0
        • 6.47.0
        • 6.46.0
        • 6.45.0
        • 6.44.0
        • 6.43.0
        • 6.42.0
        • 6.41.0
        • 6.40.0
        • 6.39.0
        • 6.38.0
        • 6.37.0
        • 6.36.0
        • 6.35.0
        • 6.34.0
        • 6.33.0
        • 6.32.0
        • 6.31.0
        • 6.30.0
        • 6.29.0
        • 6.28.0
        • 6.27.0
        • 6.26.0
        • 6.25.0
        • 6.24.0
        • 6.23.0
        • 6.22.0
        • 6.21.0
        • 6.20.0
        • 6.19.0
        • 6.18.0
        • 6.17.0
        • 6.16.0
        • 6.15.0
        • 6.14.0
        • 6.13.0
        • 6.12.0
        • 6.11.0
        • 6.10.0
        • 6.9.0
        • 6.8.0
        • 6.7.0
        • 6.6.0
        • 6.5.0
        • 6.4.0
        • 6.3.0
        • 6.2.0
        • 6.1.0
        • 6.0.0
      • Version 5
        • 5.63.0
        • 5.62.0
        • 5.61.0
        • 5.60.0
  • Deployment
    • Self-Hosted
      • AWS Marketplace
        • Data Studio
        • LLM Labs
Powered by GitBook
On this page
  • Sample Case
  • Let's try this out!
  • Notes
  1. Workspace Management
  2. File Transformer

Import Transformer

Last updated 6 months ago

By using Import Transformer, you can import almost anything into Datasaur. Currently, we only accept files with .csv, .txt, and .json extension.

Your new import transformer will have this template:

/**
 * This function should be written as this template and correctly implements ImportFunction interface.
 */
(fileContent: string): SimpleDocument => {
  /// Implement import function here
  return {
    cells: [],
    labels: [],
  };
};

The Import Transformer is a function that takes the fileContent in string, parsed using UTF-8 encoding, and return a SimpleDocument that is understood by Datasaur.

SimpleDocument is an object representation of a Document in Datasaur. It is a combined type that support token-based labeling and row-based labeling. Below is the structure of SimpleDocument:

  • cells: an array of cells. Datasaur's document is stored in tabular structure. The cell represents a single cell in a table. For token-based projects, we only support a single column table at this moment. Each row/line of the document must have the same number of columns.

    • line: A zero-based number indicating the row

    • index: A zero-based number indicating the column.

      For token-based projects, this value can only be set to 0.

    • content: The original content of a cell

    • tokens: A tokenized version of the content. This field is only used For token-based project only.

    • metadata: an optional array of key-value data to be stored per cell. You can find the structure and configuration options for metadata .

      • key: Identifier for the metadata item, represented as a string. Example: "author"

      • value: Content or data of the metadata item, represented as a string. Example: "John Doe"

      • type: Optional field indicating the type of the value in .

        • Default: text/plain

        • Supported type:

          • text/plain: Displays metadata as plain text.

          • text/html: Displays metadata as HTML.

          • image/*: Displays metadata as an image. The supported image format will depend on the .

          • audio/*: Displays metadata as an audio player. The supported audio format will depend on the .

      • pinned: Boolean that specifies if the metadata should be visible at the top of each cell. Metadata that isn’t pinned will still be viewable through the Metadata Extension panel.

      • config: Customizes appearance for text/plain types.

        • color: Determine the text color of the metadata in string. Accepts any HTML color codes and names.

        • backgroundColor: Determine the background color of the metadata in string. Accepts any HTML color codes and names.

        • borderColor: Determine the border color of the metadata in string. Accepts any HTML color codes and names.

  • labels: an array of labels

    • common fields

      • id: a unique number to identify the label. To be referred by the arrow labels.

      • startCellLine: starting line position

      • startCellIndex: starting line column position

      • startTokenIndex: starting token index position, relative to cell

      • startCharIndex: starting character index position relative to token

      • endCellLine: ending line sentence position

      • endCellIndex: ending line column position

      • endTokenIndex: ending token index position, relative to cell

      • endCharIndex: ending character index position, relative to token

      • type: type of the labels. Accept one of these values: "SPAN", "ARROW", "BOUNDING_BOX", "TIMESTAMP"

    • specific fields by its type:

      • "SPAN" or "ARROW"

        • labelSetIndex: replaces layer. Configures how the labelset items are grouped

        • labelName: replaces labelSetItemId. The text provided here will be displayed in web UI

      • "ARROW"

        • originId: id of a span label as the arrow's origin.

        • destinationId: id of a span label as the arrow's destination.

      • "BOUNDING_BOX"

        • pageIndex: page information for multiple page files, such as PDF and TIFF. Set field to 0 for common image formats, such as JPG, PNG, BMP, etc.

        • nodeCount: number of nodes, this is used for future support for polygons. Only support 4 nodes in rectangular shape for now.

        • x0: the first node's x value in screen coordinate system.

        • y0: the first node's y value in screen coordinate system.

        • x1: the second node's x value in screen coordinate system.

        • y1: the second node's y value in screen coordinate system.

        • x2: the third node's x value in screen coordinate system.

        • y2: the third node's y value in screen coordinate system.

        • x3: the fourth node's x value in screen coordinate system.

        • y3: the fourth node's y value in screen coordinate system.

      • "TIMESTAMP"

        • startTimestampMillis: the starting timestamp in millisecond.

        • endTimestampMillis: the ending timestamp in millisecond.

Sample Case

We want to label a subtitle file in .srt format and show the timestamp as metadata. The file transformer will be shown below.

/**
 * This function should be written as this template and correctly implements ImportFunction interface.
 */
(fileContent: string): SimpleDocument => {
    /// Implement import function here
    const lines = fileContent.split('\r\n\r\n');
    let currLine: number = 0;
    const cells: Cell[] = [];
    lines.forEach((line) => {
      const [, timestamp, ...subtitles] = line.split('\r\n');
      subtitles.forEach((subtitle) => {
        cells.push({
          index: 0,
          line: currLine,
          content: subtitle,
          tokens: subtitle.split(' '),
          metadata: [
            {key: "timestamp", value: timestamp, pinned: true, config: { color: "#3399cc", backgroundColor: "", borderColor: "#cc3399"}}
          ]
        });
        currLine += 1;
      });
    });

    const labels: SpanAndArrowLabel[] = [];
    let labelId = 0;

    // Label the first two tokens on the second line as "Example label"
    const secondTokenOnSecondLine = cells[1].tokens[1];
    labels.push({
      id: ++labelId,
      type: "SPAN",
      startCellLine: 1,
      startCellIndex: 0,
      startTokenIndex: 0,
      startCharIndex: 0,
      endCellLine: 1,
      endCellIndex: 0,
      endTokenIndex: 1,
      endCharIndex: secondTokenOnSecondLine.length - 1,
      labelSetIndex: 0,
      labelName: "Example label"
    })

    // Label each occurence of "Sherlock" as "Person's name".
    const sherlock = "sherlock";
    cells.forEach(cell => {
      cell.tokens.forEach((token, tokenIndex) => {
        if (token.toLowerCase() === sherlock) {
          labels.push({
            id: ++labelId, 
            type: "SPAN",
            startCellLine: cell.line,
            startCellIndex: cell.index,
            startTokenIndex: tokenIndex,
            startCharIndex: 0,
            endCellLine: cell.line,
            endCellIndex: cell.index,
            endTokenIndex: tokenIndex,
            endCharIndex: token.length - 1,
            labelSetIndex: 0,
            labelName: "Person's name",
          })
        }
      })
    })

    return {
      cells,
      labels,
    };
  };

Let's try this out!

The first step is, you have to rename the file by adding .txt. You can use the following sample file.

Click File Transformer, then copy and paste the script above.

After uploading the file, choose the Subtitle script on the Import File Transformer dropdown. Finish the project creation and launch the project.

Your project is ready!

Notes

  • You need to add Metadata extension to the project.

  • If you want to the metadata is readable in the text editor, set pinned: true.

  • Use HTML code color for text color, border color, and background color.

If you have any questions, please reach out to support@datasaur.ai.

here
MIME type
browser support
browser support
106KB
Datasaur sample - Import transformer.txt
Sherlock Holmes (2009)