Experiments
An experiment on Confident AI is a contained way to benchmark LLM applications. You can create an experiment on Confident AI and define evaluation metrics for it to evaluate and test your LLM application's performance at scale. Running an experiment produces a test run, which is the evaluation results of the tests cases that your LLM application was evaluated on.
You can evaluate test cases produced by your LLM application directly on Confident AI by simply sending over test cases via deepeval
with fields such as actual_output
and retrieval_context
populated by your LLM application. All compute and LLMs required for evaluation are provided by Confident AI.
Creating An Experiment
You can easily create an experiment on Confident AI's "Evaluation & Testing" page by providing your experiment with a unique name and a set of metrics to start with. In this RAG use case example, we have named our experiment "RAG Experiment" and have chosen the 'Answer Relevancy' and 'Contextual Relevancy' metric as a starting point.
You can then edit the metric configurations (such as threshold), add additional metrics, or even change the experiment name on another individual experiment page once you have created an experiment.
Running An Experiment
To run evaluations on your newly created experiment on Confident AI, simply:
- Create
LLMTestCase
s/ConversationalTestCase
s (in code) with required fields such asactual_output
generated by the LLM application you're trying to evaluate. - Send created test cases to Confident AI via
deepeval
using theconfident_evaluate
function, supplying theexperiment_name
in the process.
You must be logged in to Confident AI through deepeval
for this to work.
from deepeval import confident_evaluate
from deepeval.test_case import LLMTestCase
confident_evaluate(
experiment_name="My First Experiment",
test_cases=[LLMTestCase(...)]
)
There are two mandatory and one optional arguments when calling the confident_evaluate()
function:
experiment_name
: a string that specifies the name of the experiment you wish to evaluate your test cases against on Confident AI.test_cases
: a list ofLLMTestCase
s/ConversationalTestCase
s OR anEvaluationDataset
. Confident AI will evaluate your LLM application using the metrics you defined for this particular experiment on these test cases.disable_browser_opening
: a boolean which when set toTrue
, will disable the auto-opening of the browser, which brings you to the experiment page ofexperiment_name
.
Once an experiment has completed running on Confident AI's infrastructure, a test run will be produced. A test run, as will be explained in the next section, is basically the evaluation result of your LLM application based on the results of the defined experimental evaluation metrics, and is also available to view on Confident AI.
Setting Up No-Code Experiment Runs
This is particularly helpful if you wish to enable a no-code evaluation workflow for non-technical users.
You can also setup an LLM endpoint that accepts a POST
request over HTTPS to enable users to run evaluations directly on the platform without having to code, and start an evaluation through a click of a button instead. At a high level, you would have to provide Confident AI with the mappings to test case parameters such as the actual_output
, retrieval_contexr
, etc., and at evaluation time Confident AI will use the dataset and metrics settings you've specified for your experiment to unit test your LLM application.
Create an LLM Endpoint
In order for Confident AI to reach your LLM application, you'll need to expose your LLM in a RESTFUL API endpoint that is accessible over the internet. These are the hard rules you MUST follow when setting up your endpoint:
- Accepts a POST request over HTTPS.
- Returns a JSON response and MUST contain the
actual_output
value somewhere in the returned Json object. Whether or not to supply aretrieval_context
ortools_called
value in your returned Json is optional, and this depends on whether the metrics you have enabled for your experiment requires these parameters.
When Confident AI calls your LLM endpoint, it does a POST request with a data structure of this type:
{
"input": "..."
}
This input will be used to unit test your LLM application, and any JSON response returned will be parsed and used to deduce what the remaining test case parameters (i.e. actual_output
) values are.
So, it is imperative that your LLM endpoint:
- Parses the incoming data to extract this
input
value to carry out generations. - Returns the
actual_output
and any otherLLMTestCase
parameters in the JSON response with their correct respective type.
For those that want a recap of what types each test case parameter is of, visit the test cases section.
Connect Your LLM Endpoint
Now that you have your endpoint up and running, all that's left is to tell Confident AI how to reach it.
You can setup your LLM connection in Project Settings > LLM Connection. There is also a button for you to ping your LLM endpoint to sanity check that you've setup everything correctly.
You'll have to provide the:
- HTTPS endpoint you've setup.
- The json key path to the mandaatory
actual_output
parameter, and the optionalretrieval_context
, andtools_called
parameters. The json key path is a list of strings.
In order for evaluation to work, you MUST set the json key path for the actual_output
parameter. Remember, the actual_output
of your LLM application is always required for evaluation, while the retrieval_context
and tools_called
parameters are optional depending on the metrics you've enabled.
The json key path tells Confident AI where to look in your Json response for the respective test case parameter values.
For instance, if you set the key path of the actual_output
parameter to ["response", "actual output"]
, the correct Json response to return from your LLM endpoint is as follows:
{
"response": {
"actual output": "must be a string!"
}
}
That's not to say you can't include other things to return in your Json response, but the key path will determine the variables Confident AI will be using to populate LLMTestCase
s at evaluation time.
If you're wondering why expected_output
, context
, and expected_tools
is not required in setting up your json key path, it is because it is expected that these variables are static, just like the input
, and should therefore come from your dataset instead.
Start an Evaluation
You can now head back to an individual Experiment page and press the "Evaluate" button to start an evaluation on Confident AI.
This might be harder to debug than other steps, so you should aim to log any errors and reach out to support@confident-ai.com for any errors during setup.