# Data Programming

## Introduction

The **Data programming** extension helps automate data labeling for large datasets using rules and heuristics. It reduces manual effort and improves labeling efficiency. You can define labeling functions, create rules, and apply patterns to label data more systematically.

<figure><img src="/files/5odFgKm1ONc4KMy2sUur" alt=""><figcaption></figcaption></figure>

## Key features

* [Labeling functions](/assisted-labeling/data-programming/example-of-labeling-functions.md)**:** Use rule-based functions to generate labels. Individual functions may be imperfect, but combined they improve overall prediction quality. Labeling functions can be written in Python using the provided template.
* [Labeling function analysis](/assisted-labeling/data-programming/labeling-function-analysis.md): View metrics such as coverage, overlaps, and conflicts. Use these insights to improve performance and train the label model. Supports both span labeling and row labeling project.
* [Inter-annotator agreement for labeling functions](/assisted-labeling/data-programming/inter-annotator-agreement-for-data-programming.md)**:** Measure agreement between labeling functions and reviewed answers to evaluate performance.

## Supported libraries

{% hint style="info" %}
If you need additional libraries, please reach out to us by contacting **<support@datasaur.ai>**.
{% endhint %}

| Name           | Version          |
| -------------- | ---------------- |
| pandas         | 1.4.4 and later  |
| textblob       | 0.17.1 and later |
| nltk           | 3.7 and later    |
| spacy          | 3.4.1 and later  |
| scipy          | 1.9.1            |
| numpy          | 1.23.3           |
| transformers   | 4.28.1           |
| requests       | 2.28.1 and later |
| datasets       | 2.7.0            |
| openai         | 0.27.0           |
| stanza         | 1.5.0 and later  |
| spacy-fastlang | 1.0.1 and later  |
| lxml           | 4.9.2            |

## Enable Data programming

### Quick guide

1. Open **Manage extensions** dialog and enable the **Data programming** extension.

   <figure><img src="/files/JaxpgZx59TPXbr2oJ9em" alt=""><figcaption></figcaption></figure>
2. Select the question (for row labeling projects) or label set (for span labeling projects) to apply labeling functions.
3. Enable **Multi-label template** to use a template that supports predicting multiple labels. By default, labeling functions predict a single label. Find the sample templates [here](#labeling-function-template).

   <figure><img src="/files/zDMV3fZl0HT4cpnPS9ps" alt=""><figcaption></figcaption></figure>
4. Create labeling functions by clicking the **Labeling functions** button. Use the provided Python template based on your label set.

   <figure><img src="/files/QnG7mhVPhYqI4f6hnF3b" alt=""><figcaption></figcaption></figure>

{% hint style="info" %}
Edit only the section marked `Start editing here`. Do not modify the surrounding code.
{% endhint %}

5. Close the **Manage functions** dialog and click **Predict labels**.

### Labeling function templates

{% file src="/files/KVAYJov1PqnXoPD7S7OQ" %}

{% file src="/files/EbKgIvuzClh5K3iJEIip" %}

{% file src="/files/SKV49Gl2bT7id0OsDJSX" %}

{% file src="/files/X9RpPSdmmx1uaRLBNZFk" %}

## Manage labeling functions

* To create a new labeling function, click **Add**. A template will be generated in the code editor.

  <figure><img src="/files/QnG7mhVPhYqI4f6hnF3b" alt=""><figcaption></figcaption></figure>
* To rename a labeling function, hover over the name and click the edit icon. Type the new name, then click the check icon to save.

  <figure><img src="/files/VQgfIh6dQ58BkCY9i6QG" alt=""><figcaption></figcaption></figure>
* To delete labeling functions, select one or more labeling functions using the checkboxes, then click **Delete**. Confirm the action in the dialog.

  <figure><img src="/files/audAdqklymctEFr7efqd" alt=""><figcaption></figcaption></figure>
* To enable or disable labeling functions, use the toggle next to each labeling function to include or exclude it from prediction.

  <figure><img src="/files/vuMdMGCr1tmi08yMk0w3" alt=""><figcaption></figcaption></figure>

## Build labeling functions

### Default label

By default, a labeling function assigns 1 label, defined in the following line:

```python
#Assign target label based on LABELS dictionary
@apply_label(label=LABELS['labelA'])
```

### Input text

By default, a labeling function processes all columns in a row:

```python
## please check <link section 'text' in gitbook> for more info
text = list(sample.values())[0]
```

To use a specific single column:

```python
text = sample[<COLUMN_NAME>]
```

To use specific multiple columns:

```python
text = ' '.join([sample[<COLUMN_NAME_A>], sample[<COLUMN_NAME_B>]])
```

### Output format

* **Row labeling:** Returns a boolean value.
* **Span labeling:** Returns a `match_list`, a list of matched token index ranges in the format `[start_index, end_index]` .

  In this example:

  ```python
  >>> text = "Russian and American Alien Spaceship Claims Raise Eyebrows"
  >>> target_token = ["Russian", "American"]
  >>> match_list = [[0,7],[12,20]]
  ```

  `match_list` is a list of `target_token` positions regarding to `text` .

### Regex

If you use regex in your logic, you can find `match_list` with `regex.finditer` :

```python
	# *TARGET could be **keyword** or **regex pattern.*** 

>>> match_list = [re.finditer(target, text) for target in TARGET_KEYWORDS]

or

>>> date = re.compile(r"(19|20)\\d\\d[- /.](0[1-9]|[012])[- /.](0[1-9]|[12][0-9]|3[01])")
>>> PATTERNS = [date]
>>> match_list = [re.finditer(pattern, text) for pattern in PATTERNS]
```

### Labeling function remover

* Row labeling

  ```python
  def labeling_function(sample):
      return False
  ```
* Span labeling

  ```python
  def labeling_function(sample):
  		
  match_list = []
  return match_list
  ```


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.datasaur.ai/assisted-labeling/data-programming.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
