Skip to main content

Metrics On The Cloud

In the previous two sections, we saw how you can run evaluations locally using deepeval, which is the better way to run evals. However, if you are using another programming language such as Typescript, or wish to trigger evaluations on the platform directly in a click of a button, you'll need to define and run metrics on Confident AI instead.

note

There's way more advantages to running evaluations locally via deepeval, mainly being the ease of metrics customization. Feel free to skip this section if you're already able to run evaluations locally.

These metrics that are ran on the cloud still uses deepeval, but are simply executed on our Python servers instead. There are two ways to run metrics on the cloud:

  1. Via an HTTPS request to send over a list of test cases with the generated outputs from your LLM app, or
  2. On the platform directly, which will be triggered through the click of a button

Ultimately, regardless of your chosen approach, you must first define a collection of metrics to specify which metrics to run on Confident AI.

tip

You should probably opt for the former unless you have non-technical team members looking to run evaluations whenever they feel like it. The reason why triggering evaluations via HTTPs is preferred is because you don't need to setup a connection for Confident AI to get responses from your LLM endpoint, which greatly reduces the complexity of everything.

Creating Your Metric Collection

Creating a collection of metrics on Confident AI allows you to specify which group of metrics you wish to evaluate your LLM application on. The list of available metrics corresponds 100% to deepeval's metrics, which can be found here.

To create a metric collection, go to Metrics > Collections (can be found on the left navigation drawer), click on the "Create Collection" button, and enter a collection name. Your collection name must not already be taken in your project.

Choosing Metrics

There's no hard rule as to how many metric collections you should create or what the metrics inside should be, but as a rule of thumb your metric collection should:

  • Contain enough metrics to evaluate your LLM app on a focused set of criteria. This typically ranges from 1-5 metrics. If you find yourself approaching 5 metrics, think twice, because you might be better off creating a custom metric that better encapsulates your criteria instead.
  • Serve the same purpose. For example, if you find yourself having the toxicity and answer relevancy metric in a same collection, think twice, because toxicity is more for safety and compliance testing, while answer relevancy tests the functionality of your LLM app.
info

All available metrics on Confident AI uses deepeval's metrics implementation but unfortunately, not every metric available on deepeval are available on Confident AI.

You can see the list of available metrics on Confident AI are available at the Metrics > Glossary page.

Configuring Metric Settings

You'll have the option to configure each individual metric's threshold, explanability, and strictness within a collection. There are three settings you can be tuning:

  • Threshold: Determines the minimum evaluation score required for your metric to pass. If a metric fails, the test case also fails.
  • Include reason: When turned on, a metric will generate a reason alongside the evaluation score for each metric run.
  • Strict mode: When turned on, a metric will pass only and if only the evaluation is perfect (ie. 1.0).

When you add a metric to a collection, these setting parameters are automatically configured, so don't worry about it until later on.

tip

Each metric outputs a score ranging from 0 - 1, and the threshold determines whether your metric passes or fails. Here's what each configuration does:

Triggering Evaluations Via HTTPS

To trigger an evaluation on the cloud via HTTPS, simply use curl to send test cases to Confident AI for evaluation.

caution

This endpoint is 100% functional but no data validation is done when you use curl instead of Python. Please email support@confident-ai.com or contact your POC if anything isn't working.

curl -X POST "https://app.confident-ai.com/evaluate" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer CONFIDENT_AI_API_KEY" \
-d '{
"metricCollection": "your-metric-collection-name",
"testCases": [
{
"input": "What is the capital of France?",
"actual_output": "Paris",
"expected_output": "Paris",
"context": ["Geography question"],
"retrieval_context": ["Country capitals"],
"tools_called": [],
"expected_tools": []
}
]
}

Not all parameters in "testCases" are required. You should figure out what is the set of parameters required for your metric collection by reading what's required for each metric on the metrics Glossary page.

info

If for some reason and you still decide to run evaluations on Confident AI, here's how you can do it:

from deepeval import confident_evaluate

confident_evaluate(metric_collection="your-collection-name", test_cases=[...])

Triggering Evaluations On Confident AI

All you need to do is setup an LLM connection endpoint. Once this is done, simply press the Evaluate button on the Evaluations page, select your metric collection, and press evaluate!

note

You should only be doing this if you have non-technical team members that wish to trigger evaluations on the platform without going through code.

Metrics Glossary

Not all metrics available in deepeval are available on Confident AI (although most are), and you can see the list of available metrics on the Metrics > Gloassary page. The Glossary page shows the required parameters in an LLMTestCase for each metric to run, as well as an option for you to create custom metrics based on GEval.

note

Unfortunately the DAGMetric is not yet supported for self-served creation on Confident AI. If you want to run DAGMetrics on Confident AI, contact support@confident-ai.com.

In the next section, we'll show how you can monitor LLM outputs in production so that you can run evaluations on them.