Experiments
An experiment on Confident AI is a contained way to benchmark LLM applications. You can create an experiment on Confident AI and define evaluation metrics for it to evaluate and test your LLM application's performance at scale. Running an experiment produces a test run, which is the evaluation results of the tests cases that your LLM application was evaluated on.
You can evaluate test cases produced by your LLM application directly on Confident AI by simply sending over test cases via deepeval
with fields such as actual_output
and retrieval_context
populated by your LLM application. All compute and LLMs required for evaluation are provided by Confident AI.
Creating An Experiment
You can easily create an experiment on the Evaluation page by providing your experiment with a unique name and a set of metrics to start with. In this RAG use case example, we have named our experiment "RAG Experiment" and have chosen the 'Answer Relevancy' and 'Contextual Relevancy' metric as a starting point.
You can then edit the metric configurations (such as threshold), add additional metrics, or even change the experiment name on another individual experiment page once you have created an experiment.
Running Your First Experiment
We DON'T recommend doing this until you're 100% happy with the evaluations you run locally as explain in the previous section. Running evaluations on Confident AI brings no additional benefits apart from the fact that you can trigger it on the platform without going through code. This benefits non-technical team members that need no-code workflows, but also means you lose the capability to fully customize your metrics.
Triggered from deepeval
The first way to run an experiment is to send over a list of LLMTestCase
s/ConversationalTestCase
s. This means you'll have to generate actual_output
s in code, or for some users mean retrieving historical response logs for model generations such that they can be sent to Confident AI for evaluation.
from deepeval import confident_evaluate
from deepeval.test_case import LLMTestCase
test_cases = []
# Replace as you see fit
for data in historical_data:
test_case = LLMTestCase(
input=data["input"],
actual_output=data["actual_output"],
expected_output=data["expected_output"]
)
test_cases.append(test_case)
confident_evaluate(
experiment_name="Your Experiment Name",
test_cases=[test_cases]
)
From Confident AI
Once you've created an experiment, all you need to do is setup an LLM connection endpoint. Once this is done, simply press the Evaluate button on the Evaluations page, select your experiment, and press evaluate!