Datasaur
Visit our websitePricingBlogPlaygroundAPI Docs
  • Welcome to Datasaur
    • Getting started with Datasaur
  • Data Studio Projects
    • Labeling Task Types
      • Span Based
        • OCR Labeling
        • Audio Project
      • Row Based
      • Document Based
      • Bounding Box
      • Conversational
      • Mixed Labeling
      • Project Templates
        • Test Project
    • Creating a Project
      • Data Formats
      • Data Samples
      • Split Files
      • Consensus
      • Dynamic Review Capabilities
    • Pre-Labeled Project
    • Let's Get Labeling!
      • Span Based
        • Span + Line Labeling
      • Row & Document Based
      • Bounding Box Labeling
      • Conversational Labeling
      • Label Sets / Question Sets
        • Dynamic Question Set
      • Multiple Label Sets
    • Reviewing Projects
      • Review Sampling
    • Adding Documents to an Ongoing Project
    • Export Project
  • LLM Projects
    • LLM Labs Introduction
    • Sandbox
      • Direct Access LLMs
      • File Attachment
      • Conversational Prompt
    • Deployment
      • Deployment API
    • Knowledge base
      • External Object Storage
      • File Properties
    • Models
      • Amazon SageMaker JumpStart
      • Amazon Bedrock
      • Open AI
      • Azure OpenAI
      • Vertex AI
      • Custom model
      • Fine-tuning
      • LLM Comparison Table
    • Evaluation
      • Automated Evaluation
        • Multi-application evaluation
        • Custom metrics
      • Ranking (RLHF)
      • Rating
      • Performance Monitoring
    • Dataset
    • Pricing Plan
  • Workspace Management
    • Workspace
    • Role & Permission
    • Analytics
      • Inter-Annotator Agreement (IAA)
        • Cohen's Kappa Calculation
        • Krippendorff's Alpha Calculation
      • Custom Report Builder
      • Project Report
      • Evaluation Metrics
    • Activity
    • File Transformer
      • Import Transformer
      • Export Transformer
      • Upload File Transformer
      • Running File Transformer
    • Label Management
      • Label Set Management
      • Question Set Management
    • Project Management
      • Self-Assignment
        • Self-Unassign
      • Transfer Assignment Ownership
      • Reset Labeling Work
      • Mark Document as Complete
      • Project Status Workflow
        • Read-only Mode
      • Comment Feature
      • Archive Project
    • Automation
      • Action: Create Projects
  • Assisted Labeling
    • ML Assisted Labeling
      • Amazon Comprehend
      • Amazon SageMaker
      • Azure ML
      • CoreNLP NER
      • CoreNLP POS
      • Custom API
      • FewNERD
      • Google Vertex AI
      • Hugging Face
      • LLM Assisted Labeling
        • Prompt Examples
        • Custom Provider
      • LLM Labs (beta)
      • NLTK
      • Sentiment Analysis
      • spaCy
      • SparkNLP NER
      • SparkNLP POS
    • Data Programming
      • Example of Labeling Functions
      • Labeling Function Analysis
      • Inter-Annotator Agreement for Data Programming
    • Predictive Labeling
  • Assisted Review
    • Label Error Detection
  • Building Your Own Model
    • Datasaur Dinamic
      • Datasaur Dinamic with Hugging Face
      • Datasaur Dinamic with Amazon SageMaker Autopilot
  • Advanced
    • Script-Generated Question
    • Shortcuts
    • Extensions
      • Labels
      • Review
      • Document and Row Labeling
      • Bounding Box Labels
      • List of Files
      • Comments
      • Analytics
      • Dictionary
      • Search
      • Labeling Guidelines
      • Metadata
      • Grammar Checker
      • ML Assisted Labeling
      • Data Programming
      • Datasaur Dinamic
      • Predictive Labeling
      • Label Error Detection
      • LLM Sandbox
    • Tokenizers
  • Integrations
    • External Object Storage
      • AWS S3
        • With IRSA
      • Google Cloud Storage
      • Azure Blob Storage
    • SAML
      • Okta
      • Microsoft Entra ID
    • SCIM
      • Okta
      • Microsoft Entra ID
    • Webhook Notifications
      • Webhook Signature
      • Events
      • Custom Headers
    • Robosaur
      • Commands
        • Create Projects
        • Apply Project Tags
        • Export Projects
        • Generate Time Per Task Report
        • Split Document
      • Storage Options
  • API
    • Datasaur APIs
    • Credentials
    • Create Project
      • New mutation (createProject)
      • Python Script Example
    • Adding Documents
    • Labeling
      • Create Label Set
      • Add Label Sets into Existing Project
      • Get List of Label Sets in a Project
      • Add Label Set Item into Project's Label Set
      • Programmatic API Labeling
      • Inserting Span and Arrow Label into Document
    • Export Project
      • Custom Webhook
    • Get Data
      • Get List of Projects
      • Get Document Information
      • Get List of Tags
      • Get Cabinet
      • Export Team Overview
      • Check Job
    • Custom OCR
      • Importable Format
    • Custom ASR
    • Run ML-Assisted Labeling
  • Security and Compliance
    • Security and Compliance
      • 2FA
  • Compatibility & Updates
    • Common Terminology
    • Recommended Machine Specifications
    • Supported Formats
    • Supported Languages
    • Release Notes
      • Version 6
        • 6.111.0
        • 6.110.0
        • 6.109.0
        • 6.108.0
        • 6.107.0
        • 6.106.0
        • 6.105.0
        • 6.104.0
        • 6.103.0
        • 6.102.0
        • 6.101.0
        • 6.100.0
        • 6.99.0
        • 6.98.0
        • 6.97.0
        • 6.96.0
        • 6.95.0
        • 6.94.0
        • 6.93.0
        • 6.92.0
        • 6.91.0
        • 6.90.0
        • 6.89.0
        • 6.88.0
        • 6.87.0
        • 6.86.0
        • 6.85.0
        • 6.84.0
        • 6.83.0
        • 6.82.0
        • 6.81.0
        • 6.80.0
        • 6.79.0
        • 6.78.0
        • 6.77.0
        • 6.76.0
        • 6.75.0
        • 6.74.0
        • 6.73.0
        • 6.72.0
        • 6.71.0
        • 6.70.0
        • 6.69.0
        • 6.68.0
        • 6.67.0
        • 6.66.0
        • 6.65.0
        • 6.64.0
        • 6.63.0
        • 6.62.0
        • 6.61.0
        • 6.60.0
        • 6.59.0
        • 6.58.0
        • 6.57.0
        • 6.56.0
        • 6.55.0
        • 6.54.0
        • 6.53.0
        • 6.52.0
        • 6.51.0
        • 6.50.0
        • 6.49.0
        • 6.48.0
        • 6.47.0
        • 6.46.0
        • 6.45.0
        • 6.44.0
        • 6.43.0
        • 6.42.0
        • 6.41.0
        • 6.40.0
        • 6.39.0
        • 6.38.0
        • 6.37.0
        • 6.36.0
        • 6.35.0
        • 6.34.0
        • 6.33.0
        • 6.32.0
        • 6.31.0
        • 6.30.0
        • 6.29.0
        • 6.28.0
        • 6.27.0
        • 6.26.0
        • 6.25.0
        • 6.24.0
        • 6.23.0
        • 6.22.0
        • 6.21.0
        • 6.20.0
        • 6.19.0
        • 6.18.0
        • 6.17.0
        • 6.16.0
        • 6.15.0
        • 6.14.0
        • 6.13.0
        • 6.12.0
        • 6.11.0
        • 6.10.0
        • 6.9.0
        • 6.8.0
        • 6.7.0
        • 6.6.0
        • 6.5.0
        • 6.4.0
        • 6.3.0
        • 6.2.0
        • 6.1.0
        • 6.0.0
      • Version 5
        • 5.63.0
        • 5.62.0
        • 5.61.0
        • 5.60.0
  • Deployment
    • Self-Hosted
      • AWS Marketplace
        • Data Studio
        • LLM Labs
Powered by GitBook
On this page
  • Overview
  • Use case
  • Sales Report Satisfaction Scores
  • Slang and Abbreviation Replacement
  • Extracting predefine keywords from text
  1. Advanced
  2. Extensions

Data Programming

Programmatically label your data with fully customizable labeling functions

Last updated 8 months ago

Overview

Datasaur's Data Programming feature simplifies processing large datasets by automating the data labeling process using user-defined rules. With Data Programming, users can define annotation rules in Python heuristics, which are then compiled in the background to return predicted labels. These heuristics are called Labeling Functions. This feature is designed for Subject Matter Experts who understand the context and patterns of their datasets and are familiar with Python heuristics.

Use case

Data Programming is suitable for labeling dataset that has certain patterns, such as having unified format, containing specific keywords. Here are examples of the use cases you can do in Data Programming :

Sales Report Satisfaction Scores

In the example shown in the image, the labeling functions have been applied to classify satisfaction scores in a sales report. Each row in the dataset contains text describing overall satisfaction, Net Promoter Score (NPS), product quality, and customer support feedback in a recurring format.

Labeling Function: To categorize satisfaction scores, a labeling function scans for keywords like "satisfaction" and predefined score ranges.

  • For instance, if a satisfaction score is 4 out of 5 or higher and the NPS has increased, the entry is labeled as "success." Conversely, lower scores or negative feedback result in an "unsuccess" label.

This can be applied in the Labeling function following way:

@target_label(label=LABELS['success'])
def label_function(sample) -> bool:
    text = list(sample.values())[0]

    # Implement logic here
    if "Overall Satisfaction" in text:
        satisfaction_percentage = int(text.split(": ")[1].split("%")[0])
        if satisfaction_percentage >= 80:
            return True
    return False

Slang and Abbreviation Replacement

Just like the image above, in many datasets, especially those derived from social media or informal communication channels, slang and abbreviations are frequently encountered. These can be replaced with their full meanings.

Labeling Function: To replace slang and abbreviations with their full meanings, a labeling function can utilize regex patterns to identify and replace these terms using matching keywords.

  • For instance, common slang such as "cuz" can be identified and replaced with "because," and abbreviations like "ngl" can be replaced with their full forms like "not going to lie".

The logic is as followed:

First, define the potential slang and abbreviations that need to be replaced with certain keywords.

# Implement your logic here
DICT_KEYWORDS = {
    LABELS['i\'m not lying'] : ['no cap'],
    LABELS['a feeling (may be good or bad)'] : ['vibe'],
    LABELS['about'] : ['abt', 'bout'],
    LABELS['already'] : ['alr'],
    LABELS['alright'] : ['ight'],
    LABELS['and'] : ['n'],
    LABELS['as fuck'] : ['asf'],
    LABELS['ass'] : ['ahh'],
    LABELS['at the moment'] : ['atm'],
}

Then scans the text for slang words or abbreviations, collects the matches, and assigns appropriate labels based on the context and specific prefixes in the text.

for label, target_keywords in DICT_KEYWORDS.items():
    # please refer to https://docs.python.org/3/library/re.html#re.finditer for more info
    match_list = []
    for target in target_keywords:
        #'\\b \\b' is to get slang character as a word, not as part of word
        #For example, it aims to avoid 'u', 'n', and 'w' in word "unanswered" detected as slang word
        #Otherwise 'u' from text 'I bet u wont come' would be detected as slang word
        #'+' aims to get trailing letters, so far we didnt find trailing single slang char (u/n/w)

        if len(target)==1:
            pattern = r'\\b{}\\b'.format(target)
        else:
            pattern = r'\\b{}+\\b'.format(target)
        
        match_list.append(re.finditer(pattern, text, re.IGNORECASE))

    if label==LABELS['something'] and match_list:
        prefix_something = ['a', 'the', 'say', 'saying', 'said', 'think', 'thinking', 'thought']
        
        if any(pref+' '+'sum' in text for pref in prefix_something):
            PREDICTED_LABEL_COLLECTION[LABELS['something']] = match_list
        else:
            PREDICTED_LABEL_COLLECTION[LABELS['someone']] = match_list
    else:
        PREDICTED_LABEL_COLLECTION[label] = match_list

return PREDICTED_LABEL_COLLECTION

Extracting predefine keywords from text

Labeling functions enable the labeling of specific keywords, identifying and extracting relevant information to simplify the data extraction process. For example, within a text, the labeling function can be configured to label only "LOC" for terms such as "New York", "New Jersey", and "Connecticut". Also "Person" is only labelled for the keywords "Meterologists" and "residences". This customization can be achieved by modifying the DICT_KEYWORDS function to meet precise requirements.

Here’s a step to chaning the label function:

  1. Toggle the "Multiple Label Template" on:

    • Enable the multiple label template to allow for the assignment of multiple labels within the function.

  2. Keep @target_label() empty:

    • Leave the @target_label() decorator empty because multiple labels are being targeted, and specifying a single target label is not necessary.

  3. Change the DICT_KEYWORDS:

    • Define a dictionary DICT_KEYWORDS to specify the keywords associated with each label:

      • LABELS['loc']: Contains location keywords like "New York," "New Jersey," and "Connecticut."

      • LABELS['person']: Contains person-related keywords like "Meteorologists" and "Residents."

      • LABELS['time']: Contains time-related keywords like "morning" and "tomorrow."

  4. Implement the for-loop logic at the end:

    • Iterate through each label and its associated keywords in DICT_KEYWORDS.

    • For each keyword, use re.finditer to find all occurrences in the text (ignoring case).

    • Collect all matches in a list called match_list.

    • Store the match_list in the PREDICTED_LABEL_COLLECTION dictionary under the corresponding label.

    • Finally, return PREDICTED_LABEL_COLLECTION.

Labeling functions:

@target_label()
def label_function(sample):
  PREDICTED_LABEL_COLLECTION = {}

  text = list(sample.values())[0]

  # Define keywords for each label
  DICT_KEYWORDS = {
    LABELS['loc'] : ['New York', 'New Jersey', 'Connecticut'],
    LABELS['person'] : ['Meteorologists', 'Residents'],
    LABELS['time'] : ['morning', 'tomorrow']
  }

  # Iterate through each label and its associated keywords
  for label, target_keywords in DICT_KEYWORDS.items():
    # Find all occurrences of each keyword in the text
    match_list = [re.finditer(target, text, re.IGNORECASE) for target in target_keywords]

    # Store the matches in the predicted label collection
    PREDICTED_LABEL_COLLECTION[label] = match_list

  return PREDICTED_LABEL_COLLECTION

You can start to write your labeling functions using our template and do some experiments with it. Remember, our Labeling Functions should be written in Python and there are some that we support by default.

After you are done setting up your Labeling Functions, you are good to go by clicking "Predict Labels", then Data Programming will execute your labeling functions and give you some for your insights.

And if you have multiple Labeling Functions you will also have the ability to see the based on your Labeling Functions.

For further details, please visit the .

libraries
labeling function analysis
Inter-Annotator Agreement
Assisted Labeling - Data Programming
Data Programming Labeling Function Analysis
Data Programming Inter-annotator Agreement
Sales Report Satisfaction Scores
Slang and Abbreviation Replacement
Extracting predefine keywords from text