Data Programming

Programmatically label your data with fully customizable labeling functions

Overview

Datasaur's Data Programming feature simplifies processing large datasets by automating the data labeling process using user-defined rules. With Data Programming, users can define annotation rules in Python heuristics, which are then compiled in the background to return predicted labels. These heuristics are called Labeling Functions. This feature is designed for Subject Matter Experts who understand the context and patterns of their datasets and are familiar with Python heuristics.

You can start to write your labeling functions using our template and do some experiments with it. Remember, our Labeling Functions should be written in Python and there are some libraries that we support by default.

After you are done setting up your Labeling Functions, you are good to go by clicking "Predict Labels", then Data Programming will execute your labeling functions and give you some labeling function analysis for your insights.

And if you have multiple Labeling Functions you will also have the ability to see the Inter-Annotator Agreement based on your Labeling Functions.

Use case

Data Programming is suitable for labeling dataset that has certain patterns, such as having unified format, containing specific keywords. Here are examples of the use cases you can do in Data Programming :

Sales Report Satisfaction Scores

In the example shown in the image, the labeling functions have been applied to classify satisfaction scores in a sales report. Each row in the dataset contains text describing overall satisfaction, Net Promoter Score (NPS), product quality, and customer support feedback in a recurring format.

Labeling Function: To categorize satisfaction scores, a labeling function scans for keywords like "satisfaction" and predefined score ranges.

For instance, if a satisfaction score is 4 out of 5 or higher and the NPS has increased, the entry is labeled as "success." Conversely, lower scores or negative feedback result in an "unsuccess" label.

This can be applied in the Labeling function following way:

@target_label(label=LABELS['success'])
def label_function(sample) -> bool:
    text = list(sample.values())[0]

    # Implement logic here
    if "Overall Satisfaction" in text:
        satisfaction_percentage = int(text.split(": ")[1].split("%")[0])
        if satisfaction_percentage >= 80:
            return True
    return False

Slang and Abbreviation Replacement

Just like the image above, in many datasets, especially those derived from social media or informal communication channels, slang and abbreviations are frequently encountered. These can be replaced with their full meanings.

Labeling Function: To replace slang and abbreviations with their full meanings, a labeling function can utilize regex patterns to identify and replace these terms using matching keywords.

For instance, common slang such as "cuz" can be identified and replaced with "because," and abbreviations like "ngl" can be replaced with their full forms like "not going to lie".

The logic is as followed:

First, define the potential slang and abbreviations that need to be replaced with certain keywords.

# Implement your logic here
DICT_KEYWORDS = {
    LABELS['i\'m not lying'] : ['no cap'],
    LABELS['a feeling (may be good or bad)'] : ['vibe'],
    LABELS['about'] : ['abt', 'bout'],
    LABELS['already'] : ['alr'],
    LABELS['alright'] : ['ight'],
    LABELS['and'] : ['n'],
    LABELS['as fuck'] : ['asf'],
    LABELS['ass'] : ['ahh'],
    LABELS['at the moment'] : ['atm'],
}

Then scans the text for slang words or abbreviations, collects the matches, and assigns appropriate labels based on the context and specific prefixes in the text.

for label, target_keywords in DICT_KEYWORDS.items():
    # please refer to https://docs.python.org/3/library/re.html#re.finditer for more info
    match_list = []
    for target in target_keywords:
        #'\\b \\b' is to get slang character as a word, not as part of word
        #For example, it aims to avoid 'u', 'n', and 'w' in word "unanswered" detected as slang word
        #Otherwise 'u' from text 'I bet u wont come' would be detected as slang word
        #'+' aims to get trailing letters, so far we didnt find trailing single slang char (u/n/w)

        if len(target)==1:
            pattern = r'\\b{}\\b'.format(target)
        else:
            pattern = r'\\b{}+\\b'.format(target)
        
        match_list.append(re.finditer(pattern, text, re.IGNORECASE))

    if label==LABELS['something'] and match_list:
        prefix_something = ['a', 'the', 'say', 'saying', 'said', 'think', 'thinking', 'thought']
        
        if any(pref+' '+'sum' in text for pref in prefix_something):
            PREDICTED_LABEL_COLLECTION[LABELS['something']] = match_list
        else:
            PREDICTED_LABEL_COLLECTION[LABELS['someone']] = match_list
    else:
        PREDICTED_LABEL_COLLECTION[label] = match_list

return PREDICTED_LABEL_COLLECTION

Extracting predefine keywords from text

Labeling functions enable the labeling of specific keywords, identifying and extracting relevant information to simplify the data extraction process. For example, within a text, the labeling function can be configured to label only "LOC" for terms such as "New York", "New Jersey", and "Connecticut". Also "Person" is only labelled for the keywords "Meterologists" and "residences". This customization can be achieved by modifying the DICT_KEYWORDS function to meet precise requirements.

Here’s a step to chaning the label function:

Toggle the "Multiple Label Template" on:
- Enable the multiple label template to allow for the assignment of multiple labels within the function.
Keep @target_label() empty:
- Leave the @target_label() decorator empty because multiple labels are being targeted, and specifying a single target label is not necessary.
Change the DICT_KEYWORDS:
- Define a dictionary DICT_KEYWORDS to specify the keywords associated with each label:
  - LABELS['loc']: Contains location keywords like "New York," "New Jersey," and "Connecticut."
  - LABELS['person']: Contains person-related keywords like "Meteorologists" and "Residents."
  - LABELS['time']: Contains time-related keywords like "morning" and "tomorrow."
Implement the for-loop logic at the end:
- Iterate through each label and its associated keywords in DICT_KEYWORDS.
- For each keyword, use re.finditer to find all occurrences in the text (ignoring case).
- Collect all matches in a list called match_list.
- Store the match_list in the PREDICTED_LABEL_COLLECTION dictionary under the corresponding label.
- Finally, return PREDICTED_LABEL_COLLECTION.

Labeling functions:

@target_label()
def label_function(sample):
  PREDICTED_LABEL_COLLECTION = {}

  text = list(sample.values())[0]

  # Define keywords for each label
  DICT_KEYWORDS = {
    LABELS['loc'] : ['New York', 'New Jersey', 'Connecticut'],
    LABELS['person'] : ['Meteorologists', 'Residents'],
    LABELS['time'] : ['morning', 'tomorrow']
  }

  # Iterate through each label and its associated keywords
  for label, target_keywords in DICT_KEYWORDS.items():
    # Find all occurrences of each keyword in the text
    match_list = [re.finditer(target, text, re.IGNORECASE) for target in target_keywords]

    # Store the matches in the predicted label collection
    PREDICTED_LABEL_COLLECTION[label] = match_list

  return PREDICTED_LABEL_COLLECTION

For further details, please visit the Assisted Labeling - Data Programming.

Last updated 10 months ago