Skip to main content

Experiments

An experiment on Confident AI is a contained way to benchmark LLM applications. You can create an experiment on Confident AI and define evaluation metrics for it to evaluate and test your LLM application's performance at scale. Running an experiment produces a test run, which is the evaluation results of the tests cases that your LLM application was evaluated on.

did you know?

You can evaluate test cases produced by your LLM application directly on Confident AI by simply sending over test cases via deepeval with fields such as actual_output and retrieval_context populated by your LLM application. All compute and LLMs required for evaluation are provided by Confident AI.

Creating An Experiment

You can easily create an experiment on Confident AI's "Evaluation & Testing" page by providing your experiment with a unique name and a set of metrics to start with. In this RAG use case example, we have named our experiment "RAG Experiment" and have chosen the 'Answer Relevancy' and 'Contextual Relevancy' metric as a starting point.

ok

You can then edit the metric configurations (such as threshold), add additional metrics, or even change the experiment name on another individual experiment page once you have created an experiment.

ok

Running An Experiment

To run evaluations on your newly created experiment on Confident AI, simply:

  1. Create LLMTestCases/ConversationalTestCases (in code) with required fields such as actual_output generated by the LLM application you're trying to evaluate.
  2. Send created test cases to Confident AI via deepeval using the confident_evaluate function, supplying the experiment_name in the process.
note

You must be logged in to Confident AI through deepeval for this to work.

from deepeval import confident_evaluate
from deepeval.test_case import LLMTestCase

confident_evaluate(
experiment_name="My First Experiment",
test_cases=[LLMTestCase(...)]
)

There are two mandatory and one optional arguments when calling the confident_evaluate() function:

  • experiment_name: a string that specifies the name of the experiment you wish to evaluate your test cases against on Confident AI.
  • test_cases: a list of LLMTestCases/ConversationalTestCases OR an EvaluationDataset. Confident AI will evaluate your LLM application using the metrics you defined for this particular experiment on these test cases.
  • disable_browser_opening: a boolean which when set to True, will disable the auto-opening of the browser, which brings you to the experiment page of experiment_name.

Once an experiment has completed running on Confident AI's infrastructure, a test run will be produced. A test run, as will be explained in the next section, is basically the evaluation result of your LLM application based on the results of the defined experimental evaluation metrics, and is also available to view on Confident AI.