Multi-application evaluation

Overview

This feature allows you to compare and evaluate the performance of multiple models using various metrics from renowned evaluators like Ragas, Langchain, and Deepeval. By streamlining the assessment process, you can gain insights into the strengths and weaknesses of different applications and make data-driven decisions to optimize your workflows.

Get started

To evaluate multiple models:

  1. Navigate to the Evaluation page under LLM Labs menu.

  2. Click the Create evaluation project button and choose Automated evaluation project type, then Continue.

  3. Configure your evaluation by selecting the models to evaluate and choosing a dataset from the library. If you don’t have one, you can also upload a dataset in a CSV format containing two columns: prompt and expected completion.

If you can’t find your model in the list, go to the Sandbox where your model is created, and deploy or save to library. You can only evaluate deployed or saved models.

  1. Select the metric, provider, and the evaluator model you want to use for evaluation. Learn more about the evaluators and metrics.

  2. Click Create evaluation project and wait for the evaluation process to finish.

Analyze the evaluation results

After the evaluation process is completed, you can analyze the results.

Summary of the evaluation

Here you can view the total cost, time taken for generating completions, and the overall performance score given by the evaluator.

Result and score from each model

Here you can view the quality, the score, and the processing time of the generated completion from each model.

Evaluation details

To view the evaluation details of a completion, click the More icon (three dots) at the far right of the row, then select View details.

Last updated