Metrics On The Cloud
In the previous two sections, we saw how you can run evaluations locally using deepeval
, which is the better way to run evals. However, if you are using another programming language such as Typescript, or wish to trigger evaluations on the platform directly in a click of a button, you'll need to define and run metrics on Confident AI instead.
There's way more advantages to running evaluations locally via deepeval
, mainly being the ease of metrics customization. Feel free to skip this section if you're already able to run evaluations locally.
These metrics that are ran on the cloud still uses deepeval
, but are simply executed on our Python servers instead. There are two ways to run metrics on the cloud:
- Via an HTTPS request to send over a list of test cases with the generated outputs from your LLM app, or
- On the platform directly, which will be triggered through the click of a button
Ultimately, regardless of your chosen approach, you must first define a collection of metrics to specify which metrics to run on Confident AI.
You should probably opt for the former unless you have non-technical team members looking to run evaluations whenever they feel like it. The reason why triggering evaluations via HTTPs is preferred is because you don't need to setup a connection for Confident AI to get responses from your LLM endpoint, which greatly reduces the complexity of everything.
Creating Your Metric Collection
Creating a collection of metrics on Confident AI allows you to specify which group of metrics you wish to evaluate your LLM application on. The list of available metrics corresponds 100% to deepeval
's metrics, which can be found here.
To create a metric collection, go to Metrics > Collections (can be found on the left navigation drawer), click on the "Create Collection" button, and enter a collection name. Your collection name must not already be taken in your project.
Choosing Metrics
There's no hard rule as to how many metric collections you should create or what the metrics inside should be, but as a rule of thumb your metric collection should:
- Contain enough metrics to evaluate your LLM app on a focused set of criteria. This typically ranges from 1-5 metrics. If you find yourself approaching 5 metrics, think twice, because you might be better off creating a custom metric that better encapsulates your criteria instead.
- Serve the same purpose. For example, if you find yourself having the toxicity and answer relevancy metric in a same collection, think twice, because toxicity is more for safety and compliance testing, while answer relevancy tests the functionality of your LLM app.
All available metrics on Confident AI uses deepeval
's metrics implementation but unfortunately, not every metric available on deepeval
are available on Confident AI.
You can see the list of available metrics on Confident AI are available at the Metrics > Glossary page.
Configuring Metric Settings
You'll have the option to configure each individual metric's threshold, explanability, and strictness within a collection. There are three settings you can be tuning:
- Threshold: Determines the minimum evaluation score required for your metric to pass. If a metric fails, the test case also fails.
- Include reason: When turned on, a metric will generate a reason alongside the evaluation score for each metric run.
- Strict mode: When turned on, a metric will pass only and if only the evaluation is perfect (ie. 1.0).
When you add a metric to a collection, these setting parameters are automatically configured, so don't worry about it until later on.
Each metric outputs a score ranging from 0 - 1, and the threshold determines whether your metric passes or fails. Here's what each configuration does:
Triggering Evaluations Via HTTPS
To trigger an evaluation on the cloud via HTTPS, simply use curl
to send test cases to Confident AI for evaluation.
This endpoint is 100% functional but no data validation is done when you use curl
instead of Python. Please email support@confident-ai.com or contact your POC if anything isn't working.
curl -X POST "https://app.confident-ai.com/evaluate" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer CONFIDENT_AI_API_KEY" \
-d '{
"metricCollection": "your-metric-collection-name",
"testCases": [
{
"input": "What is the capital of France?",
"actual_output": "Paris",
"expected_output": "Paris",
"context": ["Geography question"],
"retrieval_context": ["Country capitals"],
"tools_called": [],
"expected_tools": []
}
]
}
Not all parameters in "testCases"
are required. You should figure out what is the set of parameters required for your metric collection by reading what's required for each metric on the metrics Glossary page.
If for some reason and you still decide to run evaluations on Confident AI, here's how you can do it:
from deepeval import confident_evaluate
confident_evaluate(metric_collection="your-collection-name", test_cases=[...])
Triggering Evaluations On Confident AI
All you need to do is setup an LLM connection endpoint. Once this is done, simply press the Evaluate button on the Evaluations page, select your metric collection, and press evaluate!
You should only be doing this if you have non-technical team members that wish to trigger evaluations on the platform without going through code.
Metrics Glossary
Not all metrics available in deepeval
are available on Confident AI (although most are), and you can see the list of available metrics on the Metrics > Gloassary page. The Glossary page shows the required parameters in an LLMTestCase
for each metric to run, as well as an option for you to create custom metrics based on GEval
.
Unfortunately the DAGMetric
is not yet supported for self-served creation on Confident AI. If you want to run DAGMetric
s on Confident AI, contact support@confident-ai.com.
In the next section, we'll show how you can monitor LLM outputs in production so that you can run evaluations on them.