Data Programming
Programmatically label your data with fully customizable labeling functions
Last updated
Programmatically label your data with fully customizable labeling functions
Last updated
Datasaur's Data Programming feature simplifies processing large datasets by automating the data labeling process using user-defined rules. With Data Programming, users can define annotation rules in Python heuristics, which are then compiled in the background to return predicted labels. These heuristics are called Labeling Functions. This feature is designed for Subject Matter Experts who understand the context and patterns of their datasets and are familiar with Python heuristics.
You can start to write your labeling functions using our template and do some experiments with it. Remember, our Labeling Functions should be written in Python and there are some libraries that we support by default.
After you are done setting up your Labeling Functions, you are good to go by clicking "Predict Labels", then Data Programming will execute your labeling functions and give you some labeling function analysis for your insights.
And if you have multiple Labeling Functions you will also have the ability to see the Inter-Annotator Agreement based on your Labeling Functions.
Data Programming is suitable for labeling dataset that has certain patterns, such as having unified format, containing specific keywords. Here are examples of the use cases you can do in Data Programming :
In the example shown in the image, the labeling functions have been applied to classify satisfaction scores in a sales report. Each row in the dataset contains text describing overall satisfaction, Net Promoter Score (NPS), product quality, and customer support feedback in a recurring format.
Labeling Function: To categorize satisfaction scores, a labeling function scans for keywords like "satisfaction" and predefined score ranges.
For instance, if a satisfaction score is 4 out of 5 or higher and the NPS has increased, the entry is labeled as "success." Conversely, lower scores or negative feedback result in an "unsuccess" label.
This can be applied in the Labeling function following way:
Just like the image above, in many datasets, especially those derived from social media or informal communication channels, slang and abbreviations are frequently encountered. These can be replaced with their full meanings.
Labeling Function: To replace slang and abbreviations with their full meanings, a labeling function can utilize regex patterns to identify and replace these terms using matching keywords.
For instance, common slang such as "cuz" can be identified and replaced with "because," and abbreviations like "ngl" can be replaced with their full forms like "not going to lie".
The logic is as followed:
First, define the potential slang and abbreviations that need to be replaced with certain keywords.
Then scans the text for slang words or abbreviations, collects the matches, and assigns appropriate labels based on the context and specific prefixes in the text.
Labeling functions enable the labeling of specific keywords, identifying and extracting relevant information to simplify the data extraction process. For example, within a text, the labeling function can be configured to label only "LOC"
for terms such as "New York"
, "New Jersey"
, and "Connecticut"
. Also "Person"
is only labelled for the keywords "Meterologists"
and "residences"
. This customization can be achieved by modifying the DICT_KEYWORDS
function to meet precise requirements.
Here’s a step to chaning the label function:
Toggle the "Multiple Label Template" on:
Enable the multiple label template to allow for the assignment of multiple labels within the function.
Keep @target_label()
empty:
Leave the @target_label()
decorator empty because multiple labels are being targeted, and specifying a single target label is not necessary.
Change the DICT_KEYWORDS
:
Define a dictionary DICT_KEYWORDS
to specify the keywords associated with each label:
LABELS['loc']
: Contains location keywords like "New York," "New Jersey," and "Connecticut."
LABELS['person']
: Contains person-related keywords like "Meteorologists" and "Residents."
LABELS['time']
: Contains time-related keywords like "morning" and "tomorrow."
Implement the for-loop logic at the end:
Iterate through each label and its associated keywords in DICT_KEYWORDS
.
For each keyword, use re.finditer
to find all occurrences in the text (ignoring case).
Collect all matches in a list called match_list
.
Store the match_list
in the PREDICTED_LABEL_COLLECTION
dictionary under the corresponding label.
Finally, return PREDICTED_LABEL_COLLECTION
.
Labeling functions:
For further details, please visit the Assisted Labeling - Data Programming.