> For the complete documentation index, see [llms.txt](https://docs.datasaur.ai/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.datasaur.ai/advanced/extensions/data-programming.md).

# Data Programming

## Overview

**Data programming** feature helps automate labeling for large datasets using user-defined rules. You can create labeling rules in Python, called **labeling functions**, to automatically predict labels based on patterns in the data.

<figure><img src="/files/85adDDDGCUhY9yQvtgjL" alt=""><figcaption></figcaption></figure>

You can start by using the provided template to create and test labeling functions. Labeling functions must be written in Python, and several libraries are supported by default.

<figure><img src="/files/qPDhSF5foZ9F19HlApLD" alt=""><figcaption></figcaption></figure>

After configuring your labeling functions, click **Predict labels** to run them and generate predicted labels. The system will also provide [labeling function analysis](/assisted-labeling/data-programming/labeling-function-analysis.md) to help you evaluate performance.

<figure><img src="/files/3vxdZ1M01cNiQSyDq7ss" alt=""><figcaption></figcaption></figure>

If multiple labeling functions are used, you can also view the [inter-annotator agreement](/assisted-labeling/data-programming/inter-annotator-agreement-for-data-programming.md) between them.

<figure><img src="/files/YEnBrPpJ93lkTWgxe1xr" alt=""><figcaption></figcaption></figure>

## Use cases

**Data programming** works best for datasets with consistent patterns, formats, or keywords.

Examples include:

* Structured reports.
* Repeated text formats.
* Keyword-based classification.
* Rule-based categorization.

### Sales report satisfaction scores

<figure><img src="/files/4yBwkybMFQnHnkRBwgmd" alt=""><figcaption></figcaption></figure>

In this example, labeling functions are used to classify customer satisfaction results in sales reports.

Each row contains recurring information such as:

* Overall satisfaction.
* Net Promoter Score (NPS).
* Product quality feedback.
* Customer support feedback.

#### Create the labeling function

A labeling function can scan for keywords like “satisfaction” and predefined score ranges. For example:

* Satisfaction scores of 4/5 or higher with increased NPS can be labeled as **success.**
* Lower scores or negative feedback can be labeled as **unsuccess.**

This logic can be implemented directly in a labeling function:

```python
@target_label(label=LABELS['success'])
def label_function(sample) -> bool:
    text = list(sample.values())[0]

    # Implement logic here
    if "Overall Satisfaction" in text:
        satisfaction_percentage = int(text.split(": ")[1].split("%")[0])
        if satisfaction_percentage >= 80:
            return True
    return False
```

### Slang and abbreviation replacement

<figure><img src="/files/5pPoGiCaEDFNJrK0opYS" alt=""><figcaption></figcaption></figure>

Many datasets, especially from social media or informal conversations, contain slang words and abbreviations. These can be automatically detected and replaced with their full meanings.

A labeling function can use regex patterns to identify slang or abbreviations and replace them with predefined full forms. For example:

* `cuz` → `because`
* `ngl` → `not going to lie`

#### Create the labeling function

First, define the slang words or abbreviations and their replacements.

```python
# Implement your logic here
DICT_KEYWORDS = {
    LABELS['i\'m not lying'] : ['no cap'],
    LABELS['a feeling (may be good or bad)'] : ['vibe'],
    LABELS['about'] : ['abt', 'bout'],
    LABELS['already'] : ['alr'],
    LABELS['alright'] : ['ight'],
    LABELS['and'] : ['n'],
    LABELS['as fuck'] : ['asf'],
    LABELS['ass'] : ['ahh'],
    LABELS['at the moment'] : ['atm'],
}
```

Then scan the text for slang words or abbreviations, collect the matches, and assign labels based on the context and specific prefixes in the text.

```python
for label, target_keywords in DICT_KEYWORDS.items():
    # please refer to https://docs.python.org/3/library/re.html#re.finditer for more info
    match_list = []
    for target in target_keywords:
        #'\\b \\b' is to get slang character as a word, not as part of word
        #For example, it aims to avoid 'u', 'n', and 'w' in word "unanswered" detected as slang word
        #Otherwise 'u' from text 'I bet u wont come' would be detected as slang word
        #'+' aims to get trailing letters, so far we didnt find trailing single slang char (u/n/w)

        if len(target)==1:
            pattern = r'\\b{}\\b'.format(target)
        else:
            pattern = r'\\b{}+\\b'.format(target)
        
        match_list.append(re.finditer(pattern, text, re.IGNORECASE))

    if label==LABELS['something'] and match_list:
        prefix_something = ['a', 'the', 'say', 'saying', 'said', 'think', 'thinking', 'thought']
        
        if any(pref+' '+'sum' in text for pref in prefix_something):
            PREDICTED_LABEL_COLLECTION[LABELS['something']] = match_list
        else:
            PREDICTED_LABEL_COLLECTION[LABELS['someone']] = match_list
    else:
        PREDICTED_LABEL_COLLECTION[label] = match_list

return PREDICTED_LABEL_COLLECTION

```

### Extracting predefine keywords from text

<figure><img src="/files/KL1YGWg1YSiNSragy8xI" alt=""><figcaption></figcaption></figure>

Labeling functions can automatically identify and label specific keywords in text to simplify data extraction. For example:

* Apply the **LOC** label to terms such as `New York`, `New Jersey`, and `Connecticut`
* Apply the **Person** label to terms such as `Meteorologists`, and `Residents`

This behavior can be customized by modifying the `DICT_KEYWORDS` function.

#### Create the labeling function

1. Enable the **Multiple-label template** option to allow multiple labels in a single labeling function.
2. Leave the `@target_label()` empty since the function will handle multiple labels.
3. Define the `DICT_KEYWORDS`. For example:
   * `LABELS['loc']` → `New York`, `New Jersey`, and `Connecticut`
   * `LABELS['person']` → `Meteorologists`, and `Residents`
   * `LABELS['time']` → `morning`, and `tomorrow`
4. Add the loop logic. The labeling function should:
   * Iterate through each label and its associated keywords in `DICT_KEYWORDS`.
   * For each keyword, use `re.finditer` to find all occurrences in the text (ignoring case).
   * Collect all matches in a list called `match_list`.
   * Store the `match_list` in the `PREDICTED_LABEL_COLLECTION` dictionary under the corresponding label.
   * Finally, return `PREDICTED_LABEL_COLLECTION`.

**Labeling functions:**

```python
@target_label()
def label_function(sample):
  PREDICTED_LABEL_COLLECTION = {}

  text = list(sample.values())[0]

  # Define keywords for each label
  DICT_KEYWORDS = {
    LABELS['loc'] : ['New York', 'New Jersey', 'Connecticut'],
    LABELS['person'] : ['Meteorologists', 'Residents'],
    LABELS['time'] : ['morning', 'tomorrow']
  }

  # Iterate through each label and its associated keywords
  for label, target_keywords in DICT_KEYWORDS.items():
    # Find all occurrences of each keyword in the text
    match_list = [re.finditer(target, text, re.IGNORECASE) for target in target_keywords]

    # Store the matches in the predicted label collection
    PREDICTED_LABEL_COLLECTION[label] = match_list

  return PREDICTED_LABEL_COLLECTION
```

{% hint style="info" %}
For further details, see [Assisted Labeling - Data Programming](/assisted-labeling/data-programming.md).
{% endhint %}


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.datasaur.ai/advanced/extensions/data-programming.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
