Skip to main content

Test Runs

A test run on Confident AI is a collection of evaluated test cases and their corresponding metric scores. There are two ways to produce a test run on Confident AI:

  • running an experiment directly on Confident AI's infrastructure by sending test cases to Confident AI
  • evaluating test cases locally on deepeval and sending it to Confident AI after evaluation

Running A Test Run on Confident AI

You can evaluate your LLM application on Confident AI's infrastructure by first creating an experiment on Confident AI and define metrics of your choice before sending over test cases (with actual_outputs populated by your LLM application) for evaluation.

from deepeval import confident_evaluate
from deepeval.test_case import LLMTestCase

confident_evaluate(
experiment_name="My First Experiment",
test_cases=[LLMTestCase(...)]
)
info

You should refer to the experiments section if you're unsure what an experiment is on Confident AI.

There are two mandatory and one optional arguments when calling the confident_evaluate() function:

  • experiment_name: a string that specifies the name of the experiment you wish to evaluate your test cases against on Confident AI.
  • test_cases: a list of LLMTestCases/ConversationalTestCases OR an EvaluationDataset. Confident AI will evaluate your LLM application using the metrics you defined for this particular experiment for all test cases.
  • disable_browser_opening: a boolean which when set to True, will disable the auto-opening of the browser, which brings you to the experiment page of experiment_name.
tip

You can also provide a name when creating an LLMTestCase. This is particularly useful if you're importing test cases from an external datasource and would like a way to label, identify and search for them on Confident AI.

Running A Test Run via deepeval

note

We're going to show a full example which includes the datasets feature from Confident AI in this example. You can similarly apply the confident_evaluate() function to the examples below to run evaluations on Confident AI instead.

Using the evaluate() Function

You can use deepeval's evaluate() function to run a test run. All test run results will be automatically sent to Confident AI if you're logged in.

First, pull the dataset:

from deepeval.dataset import EvaluationDataset

dataset = EvaluationDataset()
dataset.pull(
alias="Your Dataset Alias",
# Don't convert goldens to test cases yet
auto_convert_goldens_to_test_cases=False
)

Since we're going to be generating actual_outputs at evaluation time, we're NOT going to convert the pulled goldens to test cases yet.

info

If you're unfamilar with what we're doing with an EvaluationDataset, please first visit the datasets section.

With the dataset pulled, we can now generate the actual_outputs for each golden and populate the evaluation dataset with test cases that are ready for evaluation (we're using a hypothetical llm_app to generate actual_outputs in this example):

# A hypothetical LLM application example
import llm_app
from typing import List
from deepeval.test_case import LLMTestCase
from deepeval.dataset import Golden
...

def convert_goldens_to_test_cases(goldens: List[Golden]) -> List[LLMTestCase]:
test_cases = []
for golden in goldens:
test_case = LLMTestCase(
input=golden.input,
# Generate actual output using the 'input'.
# Replace this with your actual LLM application
actual_output=llm_app.generate(golden.input),
expected_output=golden.expected_output,
context=golden.context,
retrieval_context=golden.retrieval_context
)
test_cases.append(test_case)
return test_cases

# Data preprocessing before setting the dataset test cases
dataset.test_cases = convert_goldens_to_test_cases(dataset.goldens)

Now that your dataset is ready, define your metrics. Here, we'll be using the AnswerRelevancyMetric to keep things simple:

from deepeval.metrics import AnswerRelevancyMetric

metric = AnswerRelevancyMetric()

Lastly, using the metrics you've defined and dataset you've pulled and generated actual_outputs for, execute your test run:

import evaluate
...

evaluate(dataset, metrics=[metric])

Using the deepeval test run Command

Alternatively, you can also execute a test run via deepeval's Pytest integration, which similar to the evaluate() function, will automatically send all test run results to Confident AI. Here is a full example that is analogous to the one above:

test_llm_app.py
# A hypothetical LLM application example
import llm_app
from typing import List
from deepeval.test_case import LLMTestCase
from deepeval.dataset import EvaluationDataset, Golden
from deepeval import assert_test

dataset = EvaluationDataset()
dataset.pull(
alias="Your Dataset Alias",
# Don't convert goldens to test cases yet
auto_convert_goldens_to_test_cases=False
)

def convert_goldens_to_test_cases(goldens: List[Golden]) -> List[LLMTestCase]:
test_cases = []
for golden in goldens:
test_case = LLMTestCase(
input=golden.input,
# Generate actual output using the 'input'.
# Replace this with your actual LLM application
actual_output=llm_app.generate(golden.input),
expected_output=golden.expected_output,
context=golden.context,
retrieval_context=golden.retrieval_context
)
test_cases.append(test_case)
return test_cases

# Data preprocessing before setting the dataset test cases
dataset.test_cases = convert_goldens_to_test_cases(dataset.goldens)

@pytest.mark.parametrize(
"test_case",
dataset,
)
def test_llm_app(test_case: LLMTestCase):
metric = AnswerRelevancyMetric()
assert_test(test_case, [metric])

And execute your test file via the CLI:

deepeval test run test_llm_app.py
tip

Some users prefer deepeval test run over evaluate() is because it integrates well with CI/CD pipelines such as Github Actions. Another reason why deepeval test run is prefered is because you can easily spin up multiple processes to evaluate multiple test cases in parallel:

deepeval test run test_llm_app.py -n 10

You can learn more about all the functionalities deepeval test run offers here.

In CI/CD Pipelines

You can also execute run a test run in CI/CD pipelines by executing deepeval test run to regression test your LLM application and have the results sent to Confident AI. Here is an example of how you would do it in GitHub Actions:

.github/workflows/regression.yml
name: LLM Regression Test

on:
push:
pull_request:

jobs:
test:
runs-on: ubuntu-latest
steps:
- name: Checkout Code
uses: actions/checkout@v2

- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: "3.10"

- name: Install Poetry
run: |
curl -sSL https://install.python-poetry.org | python3 -
echo "$HOME/.local/bin" >> $GITHUB_PATH

- name: Install Dependencies
run: poetry install --no-root

- name: Login to Confident AI
env:
CONFIDENT_API_KEY: ${{ secrets.CONFIDENT_API_KEY }}
run: poetry run deepeval login --confident-api-key "$CONFIDENT_API_KEY"

- name: Run DeepEval Test Run
run: poetry run deepeval test run test_llm_app.py
note

You don't have to necessarily use poetry for installation or follow each step exactly as presented. We're merely showing an example of how a sample yaml file to execute a deepeval test run would look like.

Test Runs On Confident AI

All test runs created, on Confident AI or locally via deepeval(ie. evaluate() or deepeval test run), are automatically sent to Confident AI for anyone in your organization to view, share, download, and comment on.

info

Confident AI also keeps track of additional metadata such as time taken for a test run to finish executing, and the evaluation cost associated with this test run.

Viewing Test Cases

ok

tip

You can also filter for test case pass/fail status, metric scores, and even custom test case names that you provide when sending over test cases.

Download Test Cases

ok

Comment On Test Runs

ok

Iterating on Hyperparameters

note

This section is currently only available if you're running evaluations locally via deepeval.

In the first section, we saw how to execute a test run by dynamically generating actual_output for pulled datasets at evaluation time. To experiemnt with different hyperparameter values to iterate towards the optimal LLM to use for your LLM application for example, you should associate hyperparameter values to test runs to compare and pick the best hyperparameters on Confident AI.

Continuing from the previous example, if for example llm_app uses gpt-4o as its LLM, you can associate the gpt-4o value as such:

...

evaluate(
test_cases=[test_case],
metrics=[metric],
hyperparameters={"model": "gpt4o", "prompt template": "..."}
)
tip

You can run a heavily nested for loop to generate actual_outputs to get test run evaluation results for all hyperparameter combinations.

for model in models:
for prompt in prompts:
evaluate(..., hyperparameters={...})

Or, if you're using deepeval test run:

test_llm_app.py
...

# You should aim to make these values dynamic
@deepeval.log_hyperparameters(model="gpt-4o", prompt_template="...")
def hyperparameters():
# Return a custom flat dict to log any additional hyperparameters.
return {}

This will help Confident AI identify which hyperparameter combination were used to generate each test run results, which you can ultimately easily filter for and visualize experiments on the platform.

ok