Datasaur
Visit our websitePricingBlogPlaygroundAPI Docs
  • Welcome to Datasaur
    • Getting started with Datasaur
  • Data Studio Projects
    • Labeling Task Types
      • Span Based
        • OCR Labeling
        • Audio Project
      • Row Based
      • Document Based
      • Bounding Box
      • Conversational
      • Mixed Labeling
      • Project Templates
        • Test Project
    • Creating a Project
      • Data Formats
      • Data Samples
      • Split Files
      • Consensus
      • Dynamic Review Capabilities
    • Pre-Labeled Project
    • Let's Get Labeling!
      • Span Based
        • Span + Line Labeling
      • Row & Document Based
      • Bounding Box Labeling
      • Conversational Labeling
      • Label Sets / Question Sets
        • Dynamic Question Set
      • Multiple Label Sets
    • Reviewing Projects
      • Review Sampling
    • Adding Documents to an Ongoing Project
    • Export Project
  • LLM Projects
    • LLM Labs Introduction
    • Sandbox
      • Direct Access LLMs
      • File Attachment
      • Conversational Prompt
    • Deployment
      • Deployment API
    • Knowledge base
      • External Object Storage
      • File Properties
    • Models
      • Amazon SageMaker JumpStart
      • Amazon Bedrock
      • Open AI
      • Azure OpenAI
      • Vertex AI
      • Custom model
      • Fine-tuning
      • LLM Comparison Table
    • Evaluation
      • Automated Evaluation
        • Multi-application evaluation
        • Custom metrics
      • Ranking (RLHF)
      • Rating
      • Performance Monitoring
    • Dataset
    • Pricing Plan
  • Workspace Management
    • Workspace
    • Role & Permission
    • Analytics
      • Inter-Annotator Agreement (IAA)
        • Cohen's Kappa Calculation
        • Krippendorff's Alpha Calculation
      • Custom Report Builder
      • Project Report
      • Evaluation Metrics
    • Activity
    • File Transformer
      • Import Transformer
      • Export Transformer
      • Upload File Transformer
      • Running File Transformer
    • Label Management
      • Label Set Management
      • Question Set Management
    • Project Management
      • Self-Assignment
        • Self-Unassign
      • Transfer Assignment Ownership
      • Reset Labeling Work
      • Mark Document as Complete
      • Project Status Workflow
        • Read-only Mode
      • Comment Feature
      • Archive Project
    • Automation
      • Action: Create Projects
  • Assisted Labeling
    • ML Assisted Labeling
      • Amazon Comprehend
      • Amazon SageMaker
      • Azure ML
      • CoreNLP NER
      • CoreNLP POS
      • Custom API
      • FewNERD
      • Google Vertex AI
      • Hugging Face
      • LLM Assisted Labeling
        • Prompt Examples
        • Custom Provider
      • LLM Labs (beta)
      • NLTK
      • Sentiment Analysis
      • spaCy
      • SparkNLP NER
      • SparkNLP POS
    • Data Programming
      • Example of Labeling Functions
      • Labeling Function Analysis
      • Inter-Annotator Agreement for Data Programming
    • Predictive Labeling
  • Assisted Review
    • Label Error Detection
  • Building Your Own Model
    • Datasaur Dinamic
      • Datasaur Dinamic with Hugging Face
      • Datasaur Dinamic with Amazon SageMaker Autopilot
  • Advanced
    • Script-Generated Question
    • Shortcuts
    • Extensions
      • Labels
      • Review
      • Document and Row Labeling
      • Bounding Box Labels
      • List of Files
      • Comments
      • Analytics
      • Dictionary
      • Search
      • Labeling Guidelines
      • Metadata
      • Grammar Checker
      • ML Assisted Labeling
      • Data Programming
      • Datasaur Dinamic
      • Predictive Labeling
      • Label Error Detection
      • LLM Sandbox
    • Tokenizers
  • Integrations
    • External Object Storage
      • AWS S3
        • With IRSA
      • Google Cloud Storage
      • Azure Blob Storage
      • Dropbox
    • SAML
      • Okta
      • Microsoft Entra ID
    • SCIM
      • Okta
      • Microsoft Entra ID
    • Webhook Notifications
      • Webhook Signature
      • Events
      • Custom Headers
    • Robosaur
      • Commands
        • Create Projects
        • Apply Project Tags
        • Export Projects
        • Generate Time Per Task Report
        • Split Document
      • Storage Options
  • API
    • Datasaur APIs
    • Credentials
    • Create Project
      • New mutation (createProject)
      • Python Script Example
    • Adding Documents
    • Labeling
      • Create Label Set
      • Add Label Sets into Existing Project
      • Get List of Label Sets in a Project
      • Add Label Set Item into Project's Label Set
      • Programmatic API Labeling
      • Inserting Span and Arrow Label into Document
    • Export Project
      • Custom Webhook
    • Get Data
      • Get List of Projects
      • Get Document Information
      • Get List of Tags
      • Get Cabinet
      • Export Team Overview
      • Check Job
    • Custom OCR
      • Importable Format
    • Custom ASR
    • Run ML-Assisted Labeling
  • Security and Compliance
    • Security and Compliance
      • 2FA
  • Compatibility & Updates
    • Common Terminology
    • Recommended Machine Specifications
    • Supported Formats
    • Supported Languages
    • Release Notes
      • Version 6
        • 6.112.0
        • 6.111.0
        • 6.110.0
        • 6.109.0
        • 6.108.0
        • 6.107.0
        • 6.106.0
        • 6.105.0
        • 6.104.0
        • 6.103.0
        • 6.102.0
        • 6.101.0
        • 6.100.0
        • 6.99.0
        • 6.98.0
        • 6.97.0
        • 6.96.0
        • 6.95.0
        • 6.94.0
        • 6.93.0
        • 6.92.0
        • 6.91.0
        • 6.90.0
        • 6.89.0
        • 6.88.0
        • 6.87.0
        • 6.86.0
        • 6.85.0
        • 6.84.0
        • 6.83.0
        • 6.82.0
        • 6.81.0
        • 6.80.0
        • 6.79.0
        • 6.78.0
        • 6.77.0
        • 6.76.0
        • 6.75.0
        • 6.74.0
        • 6.73.0
        • 6.72.0
        • 6.71.0
        • 6.70.0
        • 6.69.0
        • 6.68.0
        • 6.67.0
        • 6.66.0
        • 6.65.0
        • 6.64.0
        • 6.63.0
        • 6.62.0
        • 6.61.0
        • 6.60.0
        • 6.59.0
        • 6.58.0
        • 6.57.0
        • 6.56.0
        • 6.55.0
        • 6.54.0
        • 6.53.0
        • 6.52.0
        • 6.51.0
        • 6.50.0
        • 6.49.0
        • 6.48.0
        • 6.47.0
        • 6.46.0
        • 6.45.0
        • 6.44.0
        • 6.43.0
        • 6.42.0
        • 6.41.0
        • 6.40.0
        • 6.39.0
        • 6.38.0
        • 6.37.0
        • 6.36.0
        • 6.35.0
        • 6.34.0
        • 6.33.0
        • 6.32.0
        • 6.31.0
        • 6.30.0
        • 6.29.0
        • 6.28.0
        • 6.27.0
        • 6.26.0
        • 6.25.0
        • 6.24.0
        • 6.23.0
        • 6.22.0
        • 6.21.0
        • 6.20.0
        • 6.19.0
        • 6.18.0
        • 6.17.0
        • 6.16.0
        • 6.15.0
        • 6.14.0
        • 6.13.0
        • 6.12.0
        • 6.11.0
        • 6.10.0
        • 6.9.0
        • 6.8.0
        • 6.7.0
        • 6.6.0
        • 6.5.0
        • 6.4.0
        • 6.3.0
        • 6.2.0
        • 6.1.0
        • 6.0.0
      • Version 5
        • 5.63.0
        • 5.62.0
        • 5.61.0
        • 5.60.0
  • Deployment
    • Self-Hosted
      • AWS Marketplace
        • Data Studio
        • LLM Labs
Powered by GitBook
On this page
  • Project Creation Wizard
  • Step 1: Upload your files (Video Tutorial)
  • Step 2: Preview the uploaded file (Video Tutorial)
  • Step 3: Labeler's tasks (Video Tutorial)
  • Step 4: Assignment (Video Tutorial)
  • Step 5: Configuring project settings (Video Tutorial)
  1. Data Studio Projects

Creating a Project

Last updated 5 months ago

After , you will be automatically directed to your personal workspace. If you find yourself in your personal workspace, switch to your team workspace. You can do so by selecting on your avatar in the top right. Choose “Switch Workspace” and then select your team workspace. You will be brought to your Project page of the workspace. The project page allows the Admin of the team to create projects. On this page you can see the Project shortcuts and the list of Projects that you are working on.

Creating a Project can be done by clicking on the Create project button: this enables you to create any type of project. You can also create a project by selecting one of the Project Template shortcuts: these selections contain pre-selected settings for that specific use-case. In this article, we will walk through creating a new project.

Ready to make our first project? We are going to create a Token-Based project together (span-based labeling). This type of project enables you to do use-case like Named Entity Recognition, Parts of Speech, and more. Any workflow that requires labeling specific words and/or phrases can be done with Datasaur’s Token-based project. If you would like a tutorial on creating a project for (textual classification), , , , or please watch their corresponding Youtube videos. Once you have selected “Create Project” you will find yourself in the Project Creation Wizard.

Project Creation Wizard

The Project Creation Wizard is a tool for creating custom Projects. It has five basic steps: Upload, Preview, and Labeler's tasks, Assignment, and Project settings.

You can see a list of the file formats that Datasaur natively supports for each project type by expanding the 'Supported File Types' section.

As an example, we will create a Span Labeling project by uploading several .txt files.

When uploading multiple files, ensure they are all in the same file format.

Add Project Tags

Furthermore, you can also add one or multiple project tags by selecting the available tags or creating the new one.

In this step, we get to decide two different options for our data: separation of lines and the tokenizer. Line Separator Line Separator decides how your rows in the labeling interface are split. There are two native options available:

  1. New Line will create a new row for every new line that was in your original data.

  2. Dot (.) will create a new line after each “.” in your data.

Tokenizer

Datasaur offers two options for your tokenizer: whitespace and wink.

Since we have previously uploaded .txt files, the available task types to choose from are Span and Document Labeling. In this case, we will proceed with choosing Span Labeling. Therefore, we need to provide labels to be used later in the project.

There are three different ways to upload or create our labels.

  1. Create Labels in the UI Select 'Create your own' to simply begin manually typing in your labels (as you seen in the image below). You can also manually select the color for each of your labels.

  2. Upload Labels from File Select the white space to upload a CSV of your labels from your local drive. A good question you may have is – what format is the CSV? The formatting of the CSV is very simple: your first label is in the A1 cell, your subsequent labels should go down the A column (A2, A3, A4, A5, etc).

  3. Upload Labels from Your Team’s Saved Library In your team workspace, we have a page called Label Management. This page allows you to create, edit, and delete label sets. This enables your team to save all your label sets. Utilizing this method means you do not have to re-upload or re-create a label set each time you create a new project.

Configuring Span Labeling Settings

At the bottom of the page, you'll see a section called 'Span Labeling' where you will be able to configure several things.

Limit selection to a span of 1 token: you need every token in the document labeled.

Spans should have at most one label: multiple labels to a single token or span of tokens will not be allowed.

Allow arrows to be drawn between labels: allows you to draw arrows from one label to another to annotate relationships between words. For example, this is useful for showing that an adjective is related to a noun, or a pronoun is referring to a person.

Default text selection: select whether labelers will perform a token or character selection for their labels. For example, will labelers be applying labels to whole words at once (token) or will they be able to label the individual letters within a word (character). Note: Some languages may require you to change the selection to character selection, i.e. Mandarin, Korean, or Thai.

In this step, we get to choose who will be a labeler and who will be a reviewer. When assigning personnel, you will have three roles available: Labeler, Labeler & Reviewer, and Reviewer.

Admins will have only two choices: Labeler & Reviewer, and Reviewer. An admin will always have, at a minimum, access to Reviewer Mode for any project.

Peer Review Consensus

Here we set how many labelers need to agree on a label for it to be automatically accepted by Datasaur. Peer review consensus slider allows you to determine the threshold at which labels will be automatically accepted. For highly sensitive projects where there is no room for error, you may want to ensure unanimity from all assigned labelers. For less sensitive projects where efficiency and cost are more important than accuracy, a majority vote may be sufficient. Any label where the threshold is not met will need to be manually reviewed by you, the project creator / reviewer.

If you check on No consensus, all of your labelers’ labels will be treated as conflicting labels.

In this step, we chose some final, advanced admin settings for the project.

Keep in mind that most of these choices are intended for advanced requirements.

Labeling Settings

Label set modification allows your labelers to add, edit, or remove labels in the project through the labeling interface.

Text modification enables your labelers to edit the text of the dataset.

Mask Personally Identifiable Information (PII) allows admin to decide whether sensitive information should be covered by asterisks or random characters. It also allows the admin to decide what type of information should be masked (for example: addresses, social security numbers, company name, etc.)

Allow marking unapplied label classes as N/A will present all the labels that were not applied in the project and allow the labelers to mark them as not applicable (N/A).

Reviewing Settings

Show labeler names in Review Mode if you would like to mitigate the chances of bias you can select to not show their name in Reviewer Mode.

Show rejected labels in Review Mode will allow reviewers to be able to see all labels that they have rejected.

Show labels from inactive label set in Review Mode if your project has multiple label sets, this enables the reviewers to see the labels from every label set all at once.

Show original sentences in Review Mode means that the reviewers will see all the original sentences compared to any edits the labelers have made.

Set notification for labeler's project completion – by default, reviewers will be notified that the project is ready when all of the labelers have marked their work as complete. This slider allows you to manually set which number of completion will trigger the email notification. At this point, we have finished configuring the project and can click the 'Launch Project' button to create the project. Happy Labeling!

Step 1: Upload your files ()

Files can be uploaded in three ways: by dragging and dropping, browsing files from your hard drive, or fetching files through external object storage. &#xNAN;Note: the maximum file size allowed is 50 MB. If you are interested in creating project via API, you can find the documentation .

If you forget to select the project tags at this step, you can also add them through the project management page. Simply follow the steps outlined .

Step 2: Preview the uploaded file ()

Step 3: Labeler's tasks ()

In this step, you must choose which task type you would like to work on. A detailed explanation of each task type can be found .

Step 4: Assignment ()

Allow dynamic review assignment allows you to assign your team member as a reviewer automatically when the labelers have conflicts in a project. The detailed information can be found .

Step 5: Configuring project settings ()

Video Tutorial
here
Video Tutorial
Video Tutorial
here
Video Tutorial
here
Video Tutorial
signing in
Row-based projects
Audio
OCR
Bounding Box
Document-Based Projects
here
Wink will separate certain punctuation marks from the letter
Whitespace will be akin to natural language where punctuation will be joined together with letter