# Automated Evaluation

## Overview

The automated evaluation feature addresses the challenges users face when manually evaluating completions. This process is time-consuming, labor-intensive, and prone to human error, leading to inconsistent evaluations. Automating the evaluation process helps users save time, improve accuracy, and ensure consistent evaluations.

### Prerequisites

To use automated evaluation, you need to complete some prerequisites based on what you want to evaluate:

**To evaluate an existing model in Datasaur:**

1. Ensure the model is deployed from Sandbox.
2. Prepare a ground truth dataset in a CSV file with **two** columns: `prompt` and `expected completion`.

{% file src="<https://448889121-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MbjY0HseEqu7LtYAt4d%2Fuploads%2Fgit-blob-2af95daa206892268877a4ca67a9f973ad406624%2FAutomated%20Evaluation%20-%20Application.csv?alt=media&token=20364846-9c27-4d03-acef-d6d8f12f34d7>" %}
Sample file when using LLM application
{% endfile %}

**To evaluate pre-generated completions (CSV file):**

1. Prepare a ground truth dataset in a CSV file with **three** columns: `prompt`, `completion`, and `expected completion`.

{% file src="<https://448889121-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MbjY0HseEqu7LtYAt4d%2Fuploads%2Fgit-blob-297870d1151978fa87ed6964eb5bdd2e4bfeca65%2FAutomated%20Evaluation%20-%20Pregenerated.csv?alt=media&token=c3596e5c-c8f8-4fe7-a826-4ebfe6e78f12>" %}
Sample file when using Pregenerated dataset
{% endfile %}

### Getting started

1. Navigate to the **Evaluation** page under LLM Labs menu.
2. Click the **Create evaluation project** button and choose **Automated evaluation** project type, then **Continue**.

   <figure><img src="https://448889121-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MbjY0HseEqu7LtYAt4d%2Fuploads%2Fgit-blob-aebfa48377bb382a86cde19afd25715f0180d801%2FEvaluation%20-%20Create%20evaluation%20project%20dialog.png?alt=media" alt=""><figcaption></figcaption></figure>
3. Configure your evaluation project. You can evaluate two types with automated evaluation:
   1. **Model from Sandbox**

      * Select the model that you want to evaluate. If you can’t find your model in the list, go to the [Sandbox](https://docs.datasaur.ai/llm-projects/sandbox) where your model is created, and [deploy](https://docs.datasaur.ai/llm-projects/sandbox#deploying-the-llm) or save to library. You can only evaluate deployed or saved models.
      * Upload the ground truth dataset in a CSV format containing **two** columns: `prompt` and `expected completion`.

      <figure><img src="https://448889121-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MbjY0HseEqu7LtYAt4d%2Fuploads%2Fgit-blob-995df2ed176b7c4839291945a99f1f2681e88b10%2FAutomated%20evaluation%20-%20PCW%20-%20Step%201.png?alt=media" alt=""><figcaption></figcaption></figure>
   2. **Pre-generated completions**

      * Upload the pre-generated completions combined with the ground truth dataset in a CSV format with three columns: `prompt`, `completion`, and `expected completion`.

      <figure><img src="https://448889121-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MbjY0HseEqu7LtYAt4d%2Fuploads%2Fgit-blob-4ae25b71766130ee73197f506c0e2ba69ff3b169%2FAutomated%20evaluation%20-%20PCW%20-%20Step%201%20-%20pregenerated.png?alt=media" alt=""><figcaption></figcaption></figure>
4. **Manage evaluation**: Select the metric, provider, and the evaluator model you want to use for evaluation.

   <figure><img src="https://448889121-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MbjY0HseEqu7LtYAt4d%2Fuploads%2Fgit-blob-395aa9074c5a5e7da924c38475b7d72a6bb85117%2FAutomated%20evaluation%20-%20PCW%20-%20Step%202.png?alt=media" alt=""><figcaption></figcaption></figure>

{% hint style="info" %}
Currently, we support Langchain and Ragas as our evaluation providers. You can find the list of supported metrics in the [Evaluators](https://docs.datasaur.ai/llm-projects/evaluation/automated-evaluation#evaluators) section.
{% endhint %}

### Analyze the evaluation results

After the evaluation process is completed, you can analyze the results:

1. **For models:**

   * **Generation cost and processing time**: View the total cost and time taken for generating completions.
   * **Average score**: See the overall performance score given by the evaluator.
   * **Detailed results**: For each prompt, you can examine:
     * The quality of the generated completion
     * Processing time
     * Individual score

   <figure><img src="https://448889121-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MbjY0HseEqu7LtYAt4d%2Fuploads%2Fgit-blob-5a3304b46fc969f8500e39a92660aebd1aa27347%2FAutomated%20evaluation%20-%20Project%20-%20Single%20application.png?alt=media" alt=""><figcaption></figcaption></figure>
2. **For pre-generated completions:**

   * **Average score**: See the overall performance score given by the evaluator.
   * **Detailed results**: For each prompt, you can examine:
     * The quality of the pre-generated completion
     * Individual score

   <figure><img src="https://448889121-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MbjY0HseEqu7LtYAt4d%2Fuploads%2Fgit-blob-93adebee957bd9fa8762b5c3296b3495dc8dda16%2FAutomated%20evaluation%20-%20Project%20-%20Pregenerated%20completions.png?alt=media" alt=""><figcaption></figcaption></figure>

### Evaluators

Automated evaluation supports various industry-standard evaluators to provide you with comprehensive insights into your model's performance. Each evaluator comes with a set of specific metrics tailored to different aspects of LLM evaluation.

#### Langchain

* [**Answer Correctness**](https://python.langchain.com/v0.1/docs/guides/productionization/evaluation/string/scoring_eval_chain/): Measures the accuracy of the LLM's response compared to the ground truth.

#### Ragas

* [**Answer Correctness**](https://docs.ragas.io/en/latest/concepts/metrics/answer_correctness.html): Measures the accuracy of the LLM's response compared to the ground truth.

#### Deepeval

* [**Answer relevance**](https://docs.confident-ai.com/docs/metrics-answer-relevancy): Evaluates how relevant the LLM's responses are to the given questions.
* [**Bias**](https://docs.confident-ai.com/docs/metrics-bias): Assesses the presence of bias in the LLM's outputs based on predefined criteria.
* [**Toxicity**](https://docs.confident-ai.com/docs/metrics-toxicity): Detects and quantifies toxic language or harmful content in the LLM's responses.
