# Automated Evaluation

## Overview

The automated evaluation feature addresses the challenges users face when manually evaluating completions. This process is time-consuming, labor-intensive, and prone to human error, leading to inconsistent evaluations. Automating the evaluation process helps users save time, improve accuracy, and ensure consistent evaluations.

### Prerequisites

To use automated evaluation, you need to complete some prerequisites based on what you want to evaluate:

**To evaluate an existing model in Datasaur:**

1. Ensure the model is deployed from Sandbox.
2. Prepare a ground truth dataset in a CSV file with **two** columns: `prompt` and `expected completion`.

{% file src="/files/LZSJSCQYIlMIBNy0JoDY" %}
Sample file when using LLM application
{% endfile %}

**To evaluate pre-generated completions (CSV file):**

1. Prepare a ground truth dataset in a CSV file with **three** columns: `prompt`, `completion`, and `expected completion`.

{% file src="/files/4ajfQF4DUVVobMXwVR4A" %}
Sample file when using Pregenerated dataset
{% endfile %}

### Getting started

1. Navigate to the **Evaluation** page under LLM Labs menu.
2. Click the **Create evaluation project** button and choose **Automated evaluation** project type, then **Continue**.

   <figure><img src="/files/CESwPmudc4oEmZkYXy4p" alt=""><figcaption></figcaption></figure>
3. Configure your evaluation project. You can evaluate two types with automated evaluation:
   1. **Model from Sandbox**

      * Select the model that you want to evaluate. If you can’t find your model in the list, go to the [Sandbox](https://docs.datasaur.ai/llm-projects/sandbox) where your model is created, and [deploy](https://docs.datasaur.ai/llm-projects/sandbox#deploying-the-llm) or save to library. You can only evaluate deployed or saved models.
      * Upload the ground truth dataset in a CSV format containing **two** columns: `prompt` and `expected completion`.

      <figure><img src="/files/w93nqErszbr2VjRip28J" alt=""><figcaption></figcaption></figure>
   2. **Pre-generated completions**

      * Upload the pre-generated completions combined with the ground truth dataset in a CSV format with three columns: `prompt`, `completion`, and `expected completion`.

      <figure><img src="/files/iPC8GvcQX6BiITfMdPYO" alt=""><figcaption></figcaption></figure>
4. **Manage evaluation**: Select the metric, provider, and the evaluator model you want to use for evaluation.

   <figure><img src="/files/ZDe7Hr9pqGcZWitlE3J6" alt=""><figcaption></figcaption></figure>

{% hint style="info" %}
Currently, we support Langchain and Ragas as our evaluation providers. You can find the list of supported metrics in the [Evaluators](https://docs.datasaur.ai/llm-projects/evaluation/automated-evaluation#evaluators) section.
{% endhint %}

### Analyze the evaluation results

After the evaluation process is completed, you can analyze the results:

1. **For models:**

   * **Generation cost and processing time**: View the total cost and time taken for generating completions.
   * **Average score**: See the overall performance score given by the evaluator.
   * **Detailed results**: For each prompt, you can examine:
     * The quality of the generated completion
     * Processing time
     * Individual score

   <figure><img src="/files/pbgJuBFP9iKhKTFBYGsd" alt=""><figcaption></figcaption></figure>
2. **For pre-generated completions:**

   * **Average score**: See the overall performance score given by the evaluator.
   * **Detailed results**: For each prompt, you can examine:
     * The quality of the pre-generated completion
     * Individual score

   <figure><img src="/files/Hy3QK101E2x9vzLY3wgM" alt=""><figcaption></figcaption></figure>

### Evaluators

Automated evaluation supports various industry-standard evaluators to provide you with comprehensive insights into your model's performance. Each evaluator comes with a set of specific metrics tailored to different aspects of LLM evaluation.

#### Langchain

* [**Answer Correctness**](https://python.langchain.com/v0.1/docs/guides/productionization/evaluation/string/scoring_eval_chain/): Measures the accuracy of the LLM's response compared to the ground truth.

#### Ragas

* [**Answer Correctness**](https://docs.ragas.io/en/latest/concepts/metrics/answer_correctness.html): Measures the accuracy of the LLM's response compared to the ground truth.

#### Deepeval

* [**Answer relevance**](https://docs.confident-ai.com/docs/metrics-answer-relevancy): Evaluates how relevant the LLM's responses are to the given questions.
* [**Bias**](https://docs.confident-ai.com/docs/metrics-bias): Assesses the presence of bias in the LLM's outputs based on predefined criteria.
* [**Toxicity**](https://docs.confident-ai.com/docs/metrics-toxicity): Detects and quantifies toxic language or harmful content in the LLM's responses.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.datasaur.ai/llm-projects/evaluation/automated-evaluation.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
