Datasaur
Visit our websitePricingBlogPlaygroundAPI Docs
  • Welcome to Datasaur
    • Getting started with Datasaur
  • Data Studio Projects
    • Labeling Task Types
      • Span Based
        • OCR Labeling
        • Audio Project
      • Row Based
      • Document Based
      • Bounding Box
      • Conversational
      • Mixed Labeling
      • Project Templates
        • Test Project
    • Creating a Project
      • Data Formats
      • Data Samples
      • Split Files
      • Consensus
      • Dynamic Review Capabilities
    • Pre-Labeled Project
    • Let's Get Labeling!
      • Span Based
        • Span + Line Labeling
      • Row & Document Based
      • Bounding Box Labeling
      • Conversational Labeling
      • Label Sets / Question Sets
        • Dynamic Question Set
      • Multiple Label Sets
    • Reviewing Projects
      • Review Sampling
    • Adding Documents to an Ongoing Project
    • Export Project
  • LLM Projects
    • LLM Labs Introduction
    • Sandbox
      • Direct Access LLMs
      • File Attachment
      • Conversational Prompt
    • Deployment
      • Deployment API
    • Knowledge base
      • External Object Storage
      • File Properties
    • Models
      • Amazon SageMaker JumpStart
      • Amazon Bedrock
      • Open AI
      • Azure OpenAI
      • Vertex AI
      • Custom model
      • Fine-tuning
      • LLM Comparison Table
    • Evaluation
      • Automated Evaluation
        • Multi-application evaluation
        • Custom metrics
      • Ranking (RLHF)
      • Rating
      • Performance Monitoring
    • Dataset
    • Pricing Plan
  • Workspace Management
    • Workspace
    • Role & Permission
    • Analytics
      • Inter-Annotator Agreement (IAA)
        • Cohen's Kappa Calculation
        • Krippendorff's Alpha Calculation
      • Custom Report Builder
      • Project Report
      • Evaluation Metrics
    • Activity
    • File Transformer
      • Import Transformer
      • Export Transformer
      • Upload File Transformer
      • Running File Transformer
    • Label Management
      • Label Set Management
      • Question Set Management
    • Project Management
      • Self-Assignment
        • Self-Unassign
      • Transfer Assignment Ownership
      • Reset Labeling Work
      • Mark Document as Complete
      • Project Status Workflow
        • Read-only Mode
      • Comment Feature
      • Archive Project
    • Automation
      • Action: Create Projects
  • Assisted Labeling
    • ML Assisted Labeling
      • Amazon Comprehend
      • Amazon SageMaker
      • Azure ML
      • CoreNLP NER
      • CoreNLP POS
      • Custom API
      • FewNERD
      • Google Vertex AI
      • Hugging Face
      • LLM Assisted Labeling
        • Prompt Examples
        • Custom Provider
      • LLM Labs (beta)
      • NLTK
      • Sentiment Analysis
      • spaCy
      • SparkNLP NER
      • SparkNLP POS
    • Data Programming
      • Example of Labeling Functions
      • Labeling Function Analysis
      • Inter-Annotator Agreement for Data Programming
    • Predictive Labeling
  • Assisted Review
    • Label Error Detection
  • Building Your Own Model
    • Datasaur Dinamic
      • Datasaur Dinamic with Hugging Face
      • Datasaur Dinamic with Amazon SageMaker Autopilot
  • Advanced
    • Script-Generated Question
    • Shortcuts
    • Extensions
      • Labels
      • Review
      • Document and Row Labeling
      • Bounding Box Labels
      • List of Files
      • Comments
      • Analytics
      • Dictionary
      • Search
      • Labeling Guidelines
      • Metadata
      • Grammar Checker
      • ML Assisted Labeling
      • Data Programming
      • Datasaur Dinamic
      • Predictive Labeling
      • Label Error Detection
      • LLM Sandbox
    • Tokenizers
  • Integrations
    • External Object Storage
      • AWS S3
        • With IRSA
      • Google Cloud Storage
      • Azure Blob Storage
    • SAML
      • Okta
      • Microsoft Entra ID
    • SCIM
      • Okta
      • Microsoft Entra ID
    • Webhook Notifications
      • Webhook Signature
      • Events
      • Custom Headers
    • Robosaur
      • Commands
        • Create Projects
        • Apply Project Tags
        • Export Projects
        • Generate Time Per Task Report
        • Split Document
      • Storage Options
  • API
    • Datasaur APIs
    • Credentials
    • Create Project
      • New mutation (createProject)
      • Python Script Example
    • Adding Documents
    • Labeling
      • Create Label Set
      • Add Label Sets into Existing Project
      • Get List of Label Sets in a Project
      • Add Label Set Item into Project's Label Set
      • Programmatic API Labeling
      • Inserting Span and Arrow Label into Document
    • Export Project
      • Custom Webhook
    • Get Data
      • Get List of Projects
      • Get Document Information
      • Get List of Tags
      • Get Cabinet
      • Export Team Overview
      • Check Job
    • Custom OCR
      • Importable Format
    • Custom ASR
    • Run ML-Assisted Labeling
  • Security and Compliance
    • Security and Compliance
      • 2FA
  • Compatibility & Updates
    • Common Terminology
    • Recommended Machine Specifications
    • Supported Formats
    • Supported Languages
    • Release Notes
      • Version 6
        • 6.111.0
        • 6.110.0
        • 6.109.0
        • 6.108.0
        • 6.107.0
        • 6.106.0
        • 6.105.0
        • 6.104.0
        • 6.103.0
        • 6.102.0
        • 6.101.0
        • 6.100.0
        • 6.99.0
        • 6.98.0
        • 6.97.0
        • 6.96.0
        • 6.95.0
        • 6.94.0
        • 6.93.0
        • 6.92.0
        • 6.91.0
        • 6.90.0
        • 6.89.0
        • 6.88.0
        • 6.87.0
        • 6.86.0
        • 6.85.0
        • 6.84.0
        • 6.83.0
        • 6.82.0
        • 6.81.0
        • 6.80.0
        • 6.79.0
        • 6.78.0
        • 6.77.0
        • 6.76.0
        • 6.75.0
        • 6.74.0
        • 6.73.0
        • 6.72.0
        • 6.71.0
        • 6.70.0
        • 6.69.0
        • 6.68.0
        • 6.67.0
        • 6.66.0
        • 6.65.0
        • 6.64.0
        • 6.63.0
        • 6.62.0
        • 6.61.0
        • 6.60.0
        • 6.59.0
        • 6.58.0
        • 6.57.0
        • 6.56.0
        • 6.55.0
        • 6.54.0
        • 6.53.0
        • 6.52.0
        • 6.51.0
        • 6.50.0
        • 6.49.0
        • 6.48.0
        • 6.47.0
        • 6.46.0
        • 6.45.0
        • 6.44.0
        • 6.43.0
        • 6.42.0
        • 6.41.0
        • 6.40.0
        • 6.39.0
        • 6.38.0
        • 6.37.0
        • 6.36.0
        • 6.35.0
        • 6.34.0
        • 6.33.0
        • 6.32.0
        • 6.31.0
        • 6.30.0
        • 6.29.0
        • 6.28.0
        • 6.27.0
        • 6.26.0
        • 6.25.0
        • 6.24.0
        • 6.23.0
        • 6.22.0
        • 6.21.0
        • 6.20.0
        • 6.19.0
        • 6.18.0
        • 6.17.0
        • 6.16.0
        • 6.15.0
        • 6.14.0
        • 6.13.0
        • 6.12.0
        • 6.11.0
        • 6.10.0
        • 6.9.0
        • 6.8.0
        • 6.7.0
        • 6.6.0
        • 6.5.0
        • 6.4.0
        • 6.3.0
        • 6.2.0
        • 6.1.0
        • 6.0.0
      • Version 5
        • 5.63.0
        • 5.62.0
        • 5.61.0
        • 5.60.0
  • Deployment
    • Self-Hosted
      • AWS Marketplace
        • Data Studio
        • LLM Labs
Powered by GitBook
On this page
  • Introduction
  • Key Features
  • Supported Libraries
  • Quick Start Guide
  • How to use Data Programming in General
  • Edit and adjust Labeling Functions
  • Build Labeling Functions in detail
  1. Assisted Labeling

Data Programming

Assisted-labeling feature to help you generate label using rules

Last updated 10 months ago

Introduction

Datasaur's Data Programming extension offers an advanced solution for processing large datasets. By leveraging a set of algorithms and heuristics, it automates data labeling—a task typically done manually. This is particularly beneficial for handling huge data volumes, significantly improving labeling efficiency and accuracy. The automation allows users to focus on more critical aspects of data analysis and model training, making it a key component in building a high-quality ML models. You can adjust the labeling functions, define your rules and create a pattern to help you easily label the data.

Key Features

Supported Libraries

If you require any additional libraries, please reach out to us by contacting support@datasaur.ai.

Name
Version

pandas

1.4.4 and later

textblob

0.17.1 and later

nltk

3.7 and later

spacy

3.4.1 and later

scipy

1.9.1

numpy

1.23.3

transformers

4.28.1

requests

2.28.1 and later

datasets

2.7.0

openai

0.27.0

stanza

1.5.0 and later

spacy-fastlang

1.0.1 and later

lxml

4.9.2

Quick Start Guide

How to use Data Programming in General

  1. Add Data Programming extension from the Manage Extension menu.

  2. The Data Programming Extension will appear on your right. Let's break down what we have here:

    • Target Question/Label Set: Choose the questions or label set that you want to target for Data Programming usage.

    • Labeling Functions: This button will take you to the Data Programming pop-up, covering Labeling Function Settings, Labeling Function Analysis, and Inter-Annotator Agreement.

    • Predict Labels: After creating your Labeling Functions, you can start predicting the answers or labels using those functions.

  3. You can create Labeling Functions by clicking the "Labeling Functions” button. It will display the Labeling Function Settings, where you can add your Labeling Functions. It also provides you with a code template for a Labeling Function based on your label set. Note: Pay attention to the comment we've included there; you can start editing your logic where we write (Start editing here!)— the previous codes and lines are not supposed to be edited.

  4. Close Labeling Functions editor and click Predict labels in Data Programming Extension, Applying labeling function loading bar will show up while predicting labels from Data Programming.

Labeling Function Template

Edit and adjust Labeling Functions

  1. First, click + Add button to create the Labeling Function and the Labeling Function Editor will generate a template for you.

  2. To rename a Labeling Function, click the pencil icon next to the Labeling Function, type the new name, then click the ✔️ button or cancel it by clicking the X button.

  3. Removing the labeling function can be done in two ways, delete one by one or delete multiple at once. Select one or multiple labeling functions via the check box and click on the ‘Delete’ button.

  4. There will be a confirmation pop-up to confirm the “deletion” of the project. After you click OK, the selected projects will be deleted.

  5. Use a toggle which inlines with the labeling function to activate/inactivate the labeling function for prediction.

Build Labeling Functions in detail

  • By default, labeling_function only provides 1 label, which is defined on this line

    #Assign target label based on LABELS dictionary
    @apply_label(label=LABELS['labelA'])
  • By default, labeling_function process text that contains all columns in one row.

    ## please check <link section 'text' in gitbook> for more info
    text = list(sample.values())[0]

    If need to process only one column, then use:

    text = sample[<COLUMN_NAME>]

    If need to process certain columns, then use:

    text = ' '.join([sample[<COLUMN_NAME_A>], sample[<COLUMN_NAME_B>]])
  • Labeling Function returns boolean as output for row-based and match_list as output for span-based.

    match_list is a form of a list of match token index (format: [start_`index, end_index])

    In example:

    >>> text = "Russian and American Alien Spaceship Claims Raise Eyebrows"
    >>> target_token = ["Russian", "American"]
    >>> match_list = [[0,7],[12,20]]

    match_list is a list of target_token positions regarding to text

    Special Notes:

    if you use regex in your logic codes, you can find match_list with regex.finditer

    	# *TARGET could be **keyword** or **regex pattern.*** 
    
    >>> match_list = [re.finditer(target, text) for target in TARGET_KEYWORDS]
    
    or
    
    >>> date = re.compile(r"(19|20)\\d\\d[- /.](0[1-9]|[012])[- /.](0[1-9]|[12][0-9]|3[01])")
    >>> PATTERNS = [date]
    >>> match_list = [re.finditer(pattern, text) for pattern in PATTERNS]
  • Labeling function remover

    • Row Labeling

      def labeling_function(sample):
          return False
    • Span Labeling

      def labeling_function(sample):
      		
      match_list = []
      return match_list

Labeling Functions: The Data Programming extension uses various heuristics or rules known as Labeling Functions. While these functions might not be highly accurate individually, collectively they provide better predictions than random selection. A series of examples of Labeling Functions can be found . Labeling functions can be written using the provided template in Python, and some are supported by default.

Labeling Function Analysis: View results of labeling functions, including coverage, overlaps, and conflicts. Improve performance by training the label model. Supports both span-based and row-based data. For more details, click .

Inter-Annotator Agreement for Labeling Functions: Calculate the performance of labeling functions and reviewed answers using Inter-Annotator Agreement. For more details, click .

Multiple Label Template: If you turn this on, it will create a labeling function template that can predict based on multiple labels. By default, the Labeling Function logic only specified to predict 1 label from all defined labels. However, Datasaur also provides a multilabel labeling function template if a user needs labeling function logic that is sufficient for more than 1 label. Please find the template .

here
here
here
libraries
here
583B
Default Labeling Function for Row (1).txt
653B
Default Labeling Function for Span (1).txt
775B
Multiple Labeling Function for Row.txt
978B
Multiple Labeling Function for Span.txt
Data Programming Extension
Data Programming Code Editor
Data Programming Pop Up Window
Delete one by one
Delete multiple at once