# Evaluation Metrics

To view evaluation metrics for a project:

1. Go to the **Projects** page.
2. Click the three-dot menu on a specific project and select **View project analytics**.
3. Go to the **Evaluation metrics** tab.

{% hint style="info" %}
Evaluation metrics are available only after a **project is completed**. During calculation, labels from **reviewer mode** are treated as the ground truth, and each labeler is evaluated against this version.
{% endhint %}

Currently, evaluation metrics are available for:

* **Row labeling** projects with dropdown, hierarchical dropdown, True/False, or single choice questions that have a single answer (excluding multiple answers for dropdown questions).
* **Span labeling** projects.

## How to use

Evaluation metrics are typically used to assess models rather than human labeling quality. This differs from inter-annotator agreement (IAA), which measures agreement between labelers. In this system, reviewer mode acts as the ground truth and is compared against each labeler.

You can evaluate a model using one of the following approaches:

* Use a labeling agent.
* Use the **ML-assisted labeling** extension.
* Use pre-labeled data (Datasaur schema).

### Labeling agent

The [labeling agent](/agent/labeling-agent.md) automatically calculates evaluation metrics after the project is completed by a reviewer.

### ML-assisted labeling

Use [ML-assisted labeling](/advanced/extensions/ml-assisted-labeling.md) to integrate model inference results. A designated labeler **runs the model**, and the responses represent the model’s predictions. These responses are then compared to the reviewer’s answers for evaluation. This approach supports evaluating multiple models.

To evaluate models with the **ML-assisted labeling** extension:

1. Create a project with no consensus to allow reviewers to set the correct answers directly.
2. Assign one person as both labeler and reviewer (optional, for simplicity). To evaluate multiple models, assign additional labelers accordingly.
3. Open the project and ensure you are in the labeler mode.
4. Open the **ML-assisted labeling** extension and call the model. Accept all the suggestions.
5. Switch to reviewer mode and finalize the ground truth.
6. Mark the project as complete. This will trigger the evaluation metrics calculation.

### Pre-labeled data

This approach evaluates a single model using pre-labeled data with the[ Datasaur Schema format](/compatibility-and-updates/supported-formats.md). The pre-labeled data is then compared to the reviewer’s answers for evaluation.

To evaluate a model using pre-labeled data:

1. Create a project with pre-labeled data in Datasaur schema format.
2. Assign one person as both labeler and reviewer (optional, for simplicity).
3. Open the project.
4. Switch to reviewer mode and finalize the ground truth.
5. Mark the project as complete. This will trigger the evaluation metrics calculation.

## Metrics

Evaluation metrics are calculated using standard algorithms. The process begins with constructing a confusion matrix, and results are macro-averaged to produce final scores.

### Handling empty labels

1. In row labeling projects, empty answers (unanswered optional questions) are included in the calculation, as they may represent valid outcomes.
2. In span labeling projects, empty spans are excluded due to the large number of possible unlabeled spans, which would skew results.

### Accuracy

* The proportion of correctly labeled instances among the total labels.
* Calculation = total correct labels divided by total labels.

### Precision

* The ratio of correctly labeled positive instances to the total instances predicted as positive.
* Calculation = True Positives / (True Positives + False Positives).
* *"Of all the instances predicted as positive, how many were actually positive?"*
* Real world example: Spam detection, where false positives should be minimized, because we do not want to have a perfectly normal email being incorrectly classified as a spam.

### Recall

* The ratio of correctly labeled positive instances to the total actual positive instances.
* Calculation = True Positives / (True Positives + False Negatives).
* *"Of all the actual positive instances, how many were correctly identified?"*
* Real world example: Medical diagnosis, where false negatives must be minimized, because the system cannot afford to label a cancerous case as a non-cancerous one.

### F1 score

* The harmonic mean of precision and recall.
* Calculation = 2 \* ((Precision \* Recall) / (Precision + Recall)).
* A score of 1 indicates a perfect balance between precision and recall.

### Confusion matrix

A confusion matrix provides a detailed breakdown of predictions versus ground truth.

* Row labeling:
  * The matrix represents each eligible question.
  * Both the rows and columns correspond to the available options for each question.
  * An additional *empty answer* row and column are included to account for unanswered optional questions, as described above.
* Span labeling:
  * The matrix represents a specific label set.
  * Both the rows and columns correspond to the label classes applied within that set.
  * An additional *empty label* row and column are included. The empty label pair is displayed as **N/A**, consistent with the explanation provided in the Metrics section.

## Filters

By default, metrics are calculated across all data. But, the metrics can be filtered by:

1. By documents.
2. By labelers.
3. By questions (row labeling only).
4. By label sets (span labeling only).


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.datasaur.ai/workspace-management/analytics/evaluation-metrics.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
