# Data Programming

## Introduction

Datasaur's **Data programming** extension offers an advanced solution for processing large datasets. By leveraging a set of algorithms and heuristics, it automates data labeling—a task typically done manually. This is particularly beneficial for handling huge data volumes, significantly improving labeling efficiency and accuracy. The automation allows users to focus on more critical aspects of data analysis and model training, making it a key component in building a high-quality ML models. You can adjust the labeling functions, define your rules and create a pattern to help you easily label the data.

<figure><img src="https://448889121-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MbjY0HseEqu7LtYAt4d%2Fuploads%2Fgit-blob-bde30989a0fff006f2944eafaa12f0010f8b9aeb%2FExtension%20-%20Data%20programming%20-%20cover.png?alt=media" alt=""><figcaption></figcaption></figure>

## Key Features

* **Labeling functions:** The **Data programming** extension uses various heuristics or rules known as labeling functions. While these functions might not be highly accurate individually, collectively they provide better predictions than random selection. A series of examples of labeling functions can be found [here](https://docs.datasaur.ai/assisted-labeling/data-programming/example-of-labeling-functions). Labeling functions can be written using the provided template in Python, and some [libraries](#supported-libraries) are supported by default.
* **Labeling function analysis**: View results of labeling functions, including coverage, overlaps, and conflicts. Improve performance by training the label model. Supports both span-based and row-based data. For more details, click [here](https://docs.datasaur.ai/assisted-labeling/data-programming/labeling-function-analysis).
* **Inter-annotator agreement for labeling functions:** Calculate the performance of labeling functions and reviewed answers using inter-annotator agreement. For more details, click [here](https://docs.datasaur.ai/assisted-labeling/data-programming/inter-annotator-agreement-for-data-programming).

## Supported Libraries

{% hint style="info" %}
If you require any additional libraries, please reach out to us by contacting <support@datasaur.ai>.
{% endhint %}

| Name           | Version          |
| -------------- | ---------------- |
| pandas         | 1.4.4 and later  |
| textblob       | 0.17.1 and later |
| nltk           | 3.7 and later    |
| spacy          | 3.4.1 and later  |
| scipy          | 1.9.1            |
| numpy          | 1.23.3           |
| transformers   | 4.28.1           |
| requests       | 2.28.1 and later |
| datasets       | 2.7.0            |
| openai         | 0.27.0           |
| stanza         | 1.5.0 and later  |
| spacy-fastlang | 1.0.1 and later  |
| lxml           | 4.9.2            |

## Quick Start Guide

### How to use Data Programming in General

1. Add **Data programming** extension from the **Manage extensions** dialog.

   <figure><img src="https://448889121-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MbjY0HseEqu7LtYAt4d%2Fuploads%2Fgit-blob-374e899187897bc2fe2348652b51769f0fc330a2%2FExtension%20-%20Manage%20extensions%20-%20Data%20programming.png?alt=media" alt=""><figcaption></figcaption></figure>
2. The **Data programming** extension will appear on your right. Let's break down what we have here:
   * **Target question/Label set**: Choose the questions or label set that you want to target for **Data programming** usage.
   * **Multiple-label template**: If you turn this on, it will create a labeling function template that can predict based on multiple labels. By default, the labeling function logic only specified to predict 1 label from all defined labels. However, Datasaur also provides a multilabel labeling function template if a user needs labeling function logic that is sufficient for more than 1 label. Please find the template [here](#labeling-function-template).
   * **Labeling functions**: This button will take you to the **Manage functions** dialog, covering the labeling function settings, labeling function analysis, and inter-annotator agreement.
   * **Predict labels**: After creating your labeling functions, you can start predicting the answers or labels using those functions.

     <figure><img src="https://448889121-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MbjY0HseEqu7LtYAt4d%2Fuploads%2Fgit-blob-63bdfeceeffe911690db100de765431b5ef98383%2FExtension%20-%20Data%20programming%20-%20highlight%20-%20initials.png?alt=media" alt=""><figcaption><p>Data Programming Extension</p></figcaption></figure>
3. You can create labeling functions by clicking the **Labeling functions** button. By default it will open the **Labeling function settings** tab, where you can add your labeling functions. It also provides you with a code template for a labeling function based on your label set.\
   **Note**: Pay attention to the comment we've included there; you can start editing your logic where we write `(Start editing here!)`— the previous codes and lines are not supposed to be edited.

   <figure><img src="https://448889121-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MbjY0HseEqu7LtYAt4d%2Fuploads%2Fgit-blob-0710d4a48c9a89ca8799a797a192fd4f42535de5%2FExtension%20-%20Data%20programming%20-%20manage%20LF%20-%20filled.png?alt=media" alt=""><figcaption><p>Data Programming Code Editor</p></figcaption></figure>
4. Close the **Manage functions** dialog and click **Predict labels** in the **Data programming** extension.

#### Labeling Function Template

{% file src="<https://448889121-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MbjY0HseEqu7LtYAt4d%2Fuploads%2Fgit-blob-636a757870895dbb35de5cb32b0bfa64d75c9347%2FDefault%20Labeling%20Function%20for%20Row%20(1).txt?alt=media&token=f1683d1a-d73d-4f4b-bd48-51308a1bb069>" %}

{% file src="<https://448889121-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MbjY0HseEqu7LtYAt4d%2Fuploads%2Fgit-blob-62b3a91e98fd1f7d629d4cb67f7b84a20691f218%2FDefault%20Labeling%20Function%20for%20Span%20(1).txt?alt=media&token=97b00e92-a91e-42de-ac92-56defce02c23>" %}

{% file src="<https://448889121-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MbjY0HseEqu7LtYAt4d%2Fuploads%2Fgit-blob-078390e8dffdf51710bd2a3eb578feb0c60036bc%2FMultiple%20Labeling%20Function%20for%20Row.txt?alt=media&token=f17df66a-f21f-462b-b0fe-12f3d35aa57a>" %}

{% file src="<https://448889121-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MbjY0HseEqu7LtYAt4d%2Fuploads%2Fgit-blob-f95b8c52a8d4727c8f0f6d693669a79ac6618f15%2FMultiple%20Labeling%20Function%20for%20Span.txt?alt=media&token=e13f3fac-f71a-4202-8169-8eb2b78f92be>" %}

### Edit and adjust Labeling Functions

1. First, click **Add** button to create a labeling function. In the code editor, a template will be generated.

   <figure><img src="https://448889121-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MbjY0HseEqu7LtYAt4d%2Fuploads%2Fgit-blob-0710d4a48c9a89ca8799a797a192fd4f42535de5%2FExtension%20-%20Data%20programming%20-%20manage%20LF%20-%20filled.png?alt=media" alt=""><figcaption><p>Data Programming Pop Up Window</p></figcaption></figure>
2. To rename a labeling function, hover over the name and click the edit button. Type the new name, then click the save button to save it, or cancel button to cancel it.

   <figure><img src="https://448889121-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MbjY0HseEqu7LtYAt4d%2Fuploads%2Fgit-blob-11f0352c5747c46c0f319e6d8ffbfe310f8b4ade%2FExtension%20-%20Data%20programming%20-%20manage%20LF%20-%20edit%20mode.png?alt=media" alt=""><figcaption></figcaption></figure>
3. Removing the labeling function can be done in two ways, **delete one by one** or **delete multiple at once**. Select one or multiple labeling functions via the checkbox and click **Delete**.

   <figure><img src="https://448889121-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MbjY0HseEqu7LtYAt4d%2Fuploads%2Fgit-blob-f13c974e40733efee9f480d69038e232f084f085%2FExtension%20-%20Data%20programming%20-%20manage%20LF%20-%20delete%20multiple.png?alt=media" alt=""><figcaption><p>Delete multiple</p></figcaption></figure>
4. There will be a confirmation dialog to confirm the deletion. After you click **OK**, the selected labeling functions will be deleted.

   <figure><img src="https://448889121-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MbjY0HseEqu7LtYAt4d%2Fuploads%2Fgit-blob-0d6c630f4153755daa48e6ed118b7c2d34bdb5ea%2FExtension%20-%20Data%20programming%20-%20manage%20LF%20-%20delete%20confirmation.png?alt=media" alt=""><figcaption></figcaption></figure>
5. Use a toggle which inlines with the labeling function to activate or inactivate the labeling function for prediction.

   <figure><img src="https://448889121-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MbjY0HseEqu7LtYAt4d%2Fuploads%2Fgit-blob-0dd402ef132b8f9d1a619bdb141788c058df470c%2FExtension%20-%20Data%20programming%20-%20manage%20LF%20-%20highlight%20toggle.png?alt=media" alt=""><figcaption></figcaption></figure>

## Build Labeling Functions in detail

* By default, labeling\_function only provides 1 label, which is defined on this line

  ```python
  #Assign target label based on LABELS dictionary
  @apply_label(label=LABELS['labelA'])
  ```
* By default, labeling\_function process text that contains all columns in one row.

  ```python
  ## please check <link section 'text' in gitbook> for more info
  text = list(sample.values())[0]
  ```

  If need to process only one column, then use:

  ```python
  text = sample[<COLUMN_NAME>]
  ```

  If need to process certain columns, then use:

  ```python
  text = ' '.join([sample[<COLUMN_NAME_A>], sample[<COLUMN_NAME_B>]])
  ```
* Labeling Function returns boolean as output for row-based and match\_list as output for span-based.

  `match_list` is a form of a list of match token index (**format: \[start\_\`index, end\_index]**)

  In example:

  ```python
  >>> text = "Russian and American Alien Spaceship Claims Raise Eyebrows"
  >>> target_token = ["Russian", "American"]
  >>> match_list = [[0,7],[12,20]]
  ```

  `match_list` is a list of target\_token positions regarding to `text`

  \
  **Special Notes:**

  if you use `regex` in your logic codes, you can find `match_list` with `regex.finditer`

  ```python
  	# *TARGET could be **keyword** or **regex pattern.*** 

  >>> match_list = [re.finditer(target, text) for target in TARGET_KEYWORDS]

  or

  >>> date = re.compile(r"(19|20)\\d\\d[- /.](0[1-9]|[012])[- /.](0[1-9]|[12][0-9]|3[01])")
  >>> PATTERNS = [date]
  >>> match_list = [re.finditer(pattern, text) for pattern in PATTERNS]
  ```
* Labeling function remover
  * Row Labeling

    ```python
    def labeling_function(sample):
        return False
    ```
  * Span Labeling

    ```python
    def labeling_function(sample):
    		
    match_list = []
    return match_list
    ```
