Automated Evaluation

Overview

The LLM Labs Automated Evaluation feature addresses the challenges users face when manually evaluating completions. This process is time-consuming, labor-intensive, and prone to human error, leading to inconsistent evaluations. Automating the evaluation process helps users save time, improve accuracy, and ensure consistent evaluations.

Prerequisites

To use Automated Evaluation, you need to complete some prerequisites based on what you want to evaluate:

To evaluate an existing LLM application in Datasaur:

  1. Ensure the LLM application is deployed.

  2. Prepare a ground truth dataset in a CSV file with two columns: prompt and expected completion.

Sample file when using LLM application

To evaluate pre-generated completions (CSV file):

  1. Prepare a ground truth dataset in a CSV file with three columns: prompt, completion, and expected completion.

Sample file when using Pregenerated dataset

Getting started

To begin using the LLM Automated Evaluation:

  1. Navigate to the Evaluation page under LLM Labs menu.

  2. Click the Create evaluation project button and choose Automated evaluation project type.

Create evaluation project dialog
  1. Configure your Evaluation. You can evaluate two types with automated evaluation:

    1. LLM application

      • Select the application that you want to evaluate. If you can’t find your application in the list, go to the sandbox where your application is created, and deploy it. You can only evaluate deployed LLM application.

      • Upload the ground truth dataset in a CSV format containing two columns: prompt and expected completion.

    2. Pre-generated completions

      • Upload the pre-generated combined with the ground truth dataset in a CSV format with three columns: prompt, completion, and expected completion.

Set up automated evaluation project
  1. Manage evaluation: Select the Metric, Provider, and the Evaluator model you want to use for evaluation.

Manage evaluation page

Currently, we support Langchain and Ragas as our evaluation providers. You can find the list of supported metrics in the Evaluators section.

Analyzing the evaluation results

After the evaluation process is completed, you can analyze the results:

  1. For LLM Labs applications:

    • Generation cost and processing time: View the total cost and time taken for generating completions.

    Cost and processing time
    • Average score: See the overall performance score given by the evaluator.

    Average evaluator scores
    • Detailed results: For each prompt, you can examine:

      • The quality of the generated completion

      • Processing time

      • Individual score

    Detailed results
  2. For pre-generated completions:

    • Average score: See the overall performance score given by the evaluator.

    Pre-generated average evaluator scores
    • Detailed results: For each prompt, you can examine:

      • The quality of the pre-generated completion

      • Individual score

    Pre-generated detailed results

Evaluators

LLM Labs Automated Evaluation supports various industry-standard evaluators to provide you with comprehensive insights into your LLM's performance. Each evaluator comes with a set of specific metrics tailored to different aspects of LLM evaluation.

Langchain

  • Answer Correctness: Measures the accuracy of the LLM's response compared to the ground truth.

Ragas

  • Answer Correctness: Measures the accuracy of the LLM's response compared to the ground truth.

Deepeval

  • Answer relevance: Evaluates how relevant the LLM's responses are to the given questions.

  • Bias: Assesses the presence of bias in the LLM's outputs based on predefined criteria.

  • Toxicity: Detects and quantifies toxic language or harmful content in the LLM's responses.

Last updated