Evaluation Metrics

Easily evaluate your labeling process

To access this page, click the triple dots on a specific project, select "View Project Analytics," and navigate to the "Evaluation Metrics" tab.

Note that evaluation metrics are available only after a project is completed. During calculation, labels from Reviewer Mode serve as the ground truth, evaluating each labeler against the Reviewer Mode version.

Currently, evaluation metrics are available for:

  1. Row Labeling projects with dropdown, hierarchical dropdown, Yes/No or single choice questions that have a single answer (excluding multiple answers for dropdown questions).

  2. Span Labeling projects.

How to Use

Datasaur acknowledges that evaluation metrics are commonly employed for assessing models rather than labeling work. This distinguishes it from IAA as its primary purpose is to gauge the agreement level among labelers. As previously mentioned, the system compares the Reviewer Mode (as the ground truth) to each individual labeler. However, you can still assess a model using one of the following approaches:

  1. The most straightforward one would be to use the Labeling Agent, which automatically calculate the evaluation metrics right after you complete the project in Reviewer Mode.

  2. Integrate the inference results from the model through ML-assisted Labeling, which will be invoked by a specific labeler and represented as the model. The system will function as usual but will evaluate the Reviewer Mode against the labeler who provides answers from the model, essentially evaluating the model. This approach could work for evaluating multiple ML models.

    1. Create the project as usual, set no consensus so that the reviewers can easily put the correct answer without handling conflicts. To simplify the work, assign one person that can act as Labeler and Reviewer. You can assign the labelers accordingly if you want to assess more than one ML model.

    2. Open the project and you will be on the Labeler Mode by default. Call the ML-Assisted Labeling, representing the inference from the model.

    3. Change to Reviewer Mode and finish the labeling to provide the ground truth.

    4. Complete the project, triggering the Evaluation Metrics calculation.

  3. Integrate the inference results from the model using pre-labeled data by using the Datasaur Schema format. Similarly, this will be evaluated when compared to the Reviewer Mode. This approach could only work for evaluating one ML model.

    1. Create the project with pre-labeled input. To simplify the work, assign one person that can act as Labeler and Reviewer.

    2. Open the project and you will be on the Labeler Mode by default.

    3. Change to Reviewer Mode and finish the labeling to provide the ground truth.

    4. Complete the project, triggering the Evaluation Metrics calculation.

Metrics

Evaluation metrics are calculated using the standard algorithm, which begins by constructing the confusion matrix. After the metrics are computed, they are macro-averaged to produce the final scores.

It is important to highlight how empty label pairs are treated in this process. Below are the definitions and rules for each labeling type:

  1. Row Labeling — Represents an optional question left unanswered in both Labeler Mode and Reviewer Mode.

    These cases are included in the calculation because an empty response may be the correct answer for certain optional questions. Excluding them would ignore valid labeling outcomes and will definitely affect the calculation.

  2. Span Labeling — Represents a span without a label in both Labeler Mode and Reviewer Mode.

    These cases are excluded from the calculation because the number of possible unlabeled spans can be extremely large. Including them would not be meaningful and would significantly skew the calculation.

Accuracy

  • The proportion of correctly labeled instances among the total labels.

  • Calculation = total correct labels divided by total labels.

Precision

  • The ratio of correctly labeled positive instances to the total instances predicted as positive.

  • Calculation = True Positives / (True Positives + False Positives).

  • "Of all the instances predicted as positive, how many were actually positive?"

  • Real world example: Maximizing precision for email spam detection because we do not want to have a perfectly normal email being incorrectly classified as a spam, essentially minimizing the false positives.

Recall

  • The ratio of correctly labeled positive instances to the total actual positive instances.

  • Calculation = True Positives / (True Positives + False Negatives).

  • "Of all the actual positive instances, how many were correctly identified?"

  • Real world example: Maximizing recall for medical diagnostic tools because the system cannot afford to label a cancerous case as a non-cancerous one, essentially minimizing the false negatives.

F1 Score

  • The harmonic mean of precision and recall.

  • Calculation = 2 * ((Precision * Recall) / (Precision + Recall)).

  • As the other metrics, it ranges between 0 and 1. Specifically for this metric, 1 indicates the perfect balance between precision and recall.

Confusion Matrix

A tabular representation that provides a detailed breakdown of labels or answers. Its structure differs slightly depending on the labeling type:

  1. Row Labeling:

    1. The matrix represents each eligible question (i.e., question types supported for evaluation metrics as described above).

    2. Both the rows and columns correspond to the available options for each question.

    3. An additional empty answer row and column are included to account for unanswered optional questions, as described above.

  2. Span Labeling:

    1. The matrix represents a specific label set.

    2. Both the rows and columns correspond to the label classes applied within that set.

    3. An additional empty label row and column are included. The empty label pair is displayed as N/A, consistent with the explanation provided in the Metrics section.

Filter

There are three types of filters for metric calculation, with the default averaging all data:

  1. By documents

  2. By labelers

  3. By questions (only for Row Labeling)

  4. By label sets (only for Span Labeling)

Last updated