Test Cases
Quick Summary
A test case is a blueprint provided by deepeval
to unit test LLM outputs. There are two types of test cases in deepeval
: LLMTestCase
and ConversationalTestCase
.
Throughout this documentation, you should assume the term 'test case' refers to an LLMTestCase
instead of a ConversationalTestCase
.
While a ConversationalTestCase
is a list of messages represented by LLMTestCase
s, an LLMTestCase
is the most prominent type of test case in deepeval
and is based on seven parameters:
input
actual_output
- [Optional]
expected_output
- [Optional]
context
- [Optional]
retrieval_context
- [Optional]
tools_called
- [Optional]
expected_tools
Here's an example implementation of a test case:
from deepeval.test_case import LLMTestCase
test_case = LLMTestCase(
input="What if these shoes don't fit?",
expected_output="You're eligible for a 30 day refund at no extra cost.",
actual_output="We offer a 30-day full refund at no extra cost.",
context=["All customers are eligible for a 30 day full refund at no extra cost."],
retrieval_context=["Only shoes can be refunded."],
tools_called=["WebSearch"],
expected_tools=["WebSearch", "QueryDatabase"]
)
Since deepeval
is an LLM evaluation framework, the input
and actual_output
are always mandatory. However, this does not mean they are necessarily used for evaluation.
Additionally, depending on the specific metric you're evaluating your test cases on, you may or may not require a retrieval_context
, expected_output
, context
, tools_called
, and/or expected_tools
as additional parameters. For example, you won't need expected_output
, context
, tools_called
, and expected_tools
if you're just measuring answer relevancy, but if you're evaluating hallucination you'll have to provide context
in order for deepeval
to know what the ground truth is.
LLM Test Case
An LLMTestCase
in deepeval
can be used to unit test LLM application (which can just be an LLM itself) outputs, which includes use cases such as RAG and LLM agents. With the exception of conversational metrics, which are metrics to evaluate conversations instead of individual LLM responses, you can use any LLM evaluation metric deepeval
offers to evaluate an LLMTestCase
.
You cannot use conversational metrics to evaluate an LLMTestCase
. Conveniently, most metrics in deepeval
are non-conversational.
Keep reading to learn which parameters in an LLMTestCase
are required to evaluate different aspects of an LLM applications - ranging from pure LLMs, RAG pipelines, and even LLM agents.
Input
The input
mimics a user interacting with your LLM application. The input is the direct input to your prompt template, and so SHOULD NOT CONTAIN your prompt template.
from deepeval.test_case import LLMTestCase
test_case = LLMTestCase(
input="Why did the chicken cross the road?",
# Replace this with your actual LLM application
actual_output="Quite frankly, I don't want to know..."
)
You should NOT include prompt templates as part of a test case because hyperparameters such as prompt templates are independent variables that you try to optimize for based on the metric scores you get from evaluation.
If you're logged into Confident AI, you can associate hyperparameters such as prompt templates with each test run to easily figure out which prompt template gives the best actual_output
s for a given input
:
deepeval login
import deepeval
from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric
def test_llm():
test_case = LLMTestCase(input="...", actual_output="...")
answer_relevancy_metric = AnswerRelevancyMetric()
assert_test(test_case, [answer_relevancy_metric])
# You should aim to make these values dynamic
@deepeval.log_hyperparameters(model="gpt-4o", prompt_template="...")
def hyperparameters():
# You can also return an empty dict {} if there's no additional parameters to log
return {
"temperature": 1,
"chunk size": 500
}
deepeval test run test_file.py
Actual Output
The actual_output
is simply what your LLM application returns for a given input. This is what your users are going to interact with. Typically, you would import your LLM application (or parts of it) into your test file, and invoke it at runtime to get the actual output.
# A hypothetical LLM application example
import chatbot
input = "Why did the chicken cross the road?"
test_case = LLMTestCase(
input=input,
actual_output=chatbot.run(input)
)
You may also choose to evaluate with precomputed actual_output
s, instead of generating actual_output
s at evaluation time.
Expected Output
The expected_output
is literally what you would want the ideal output to be. Note that this parameter is optional depending on the metric you want to evaluate.
The expected output doesn't have to exactly match the actual output in order for your test case to pass since deepeval
uses a variety of methods to evaluate non-deterministic LLM outputs. We'll go into more details in the metrics section.
# A hypothetical LLM application example
import chatbot
input = "Why did the chicken cross the road?"
test_case = LLMTestCase(
input=input,
actual_output=chatbot.run(input),
expected_output="To get to the other side!"
)
Context
The context
is an optional parameter that represents additional data received by your LLM application as supplementary sources of golden truth. You can view it as the ideal segment of your knowledge base relevant to a specific input. Context allows your LLM to generate customized outputs that are outside the scope of the data it was trained on.
In RAG applications, contextual information is typically stored in your selected vector database, which is represented by retrieval_context
in an LLMTestCase
and is not to be confused with context
. Conversely, for a fine-tuning use case, this data is usually found in training datasets used to fine-tune your model. Providing the appropriate contextual information when constructing your evaluation dataset is one of the most challenging part of evaluating LLMs, since data in your knowledge base can constantly be changing.
Unlike other parameters, a context accepts a list of strings.
# A hypothetical LLM application example
import chatbot
input = "Why did the chicken cross the road?"
test_case = LLMTestCase(
input=input,
actual_output=chatbot.run(input),
expected_output="To get to the other side!",
context=["The chicken wanted to cross the road."]
)
Often times people confuse expected_output
with context
since due to their similar level of factual accuracy. However, while both are (or should be) factually correct, expected_output
also takes aspects like tone and linguistic patterns into account, whereas context is strictly factual.
Retrieval Context
The retrieval_context
is an optional parameter that represents your RAG pipeline's retrieval results at runtime. By providing retrieval_context
, you can determine how well your retriever is performing using context
as a benchmark.
# A hypothetical LLM application example
import chatbot
input = "Why did the chicken cross the road?"
test_case = LLMTestCase(
input=input,
actual_output=chatbot.run(input),
expected_output="To get to the other side!",
context=["The chicken wanted to cross the road."],
retrieval_context=["The chicken liked the other side of the road better"]
)
Remember, context
is the ideal retrieval results for a given input and typically come from your evaluation dataset, whereas retrieval_context
is your LLM application's actual retrieval results. So, while they might look similar at times, they are not the same.
Tools Called
The tools_called
parameter is an optional parameter that represents the tools your LLM agent actually invoked during execution. By providing tools_called
, you can evaluate how effectively your LLM agent utilized the tools available to it.
# A hypothetical LLM application example
import chatbot
test_case = LLMTestCase(
input="Why did the chicken cross the road?",
actual_output=chatbot.run(input),
# Replace this with the tools that were actually used
tools_called=["WebSearch", "DatabaseQuery"]
)
tools_called
and expected_tools
are LLM test case parameters that are utilized only in agentic evaluation metrics. These parameters allow you to assess the tool usage correctness of your LLM application and ensure that it meets the expected tool usage standards.
Expected Tools
The expected_tools
parameter is an optional parameter that represents the tools that ideally should have been used to generate the output. By providing expected_tools
, you can assess whether your LLM application used the tools you anticipated for optimal performance.
# A hypothetical LLM application example
import chatbot
input = "Why did the chicken cross the road?"
test_case = LLMTestCase(
input=input,
actual_output=chatbot.run(input),
# Replace this with the tools that were actually used
tools_called=["WebSearch", "DatabaseQuery"],
expected_tools=["DatabaseQuery"]
)
Conversational Test Case
A ConversationalTestCase
in deepeval
is simply a list of Message
s, and a Message
is composed of an LLMTestCase
s. While an LLMTestCase
represents an individual LLM interaction, a ConversationalTestCase
encapsulates a series of LLMTestCase
s that make up an LLM-based conversation. This is particular useful if you're looking to for example evaluate a conversation between a user and an LLM-based chatbot.
While you cannot use a conversational metric on an LLMTestCase
, a ConversationalTestCase
can be evaluated using both non-conversational and conversational metrics.
from deepeval.test_case import LLMTestCase, ConversationalTestCase, Message
llm_test_case = LLMTestCase(
# Replace this with your user input
input="Why did the chicken cross the road?",
# Replace this with your actual LLM application
actual_output="Quite frankly, I don't want to know..."
)
test_case = ConversationalTestCase(
messages=[Message(llm_test_case=llm_test_case)]
)
You can apply both non-conversational and conversational metrics to a ConversationalTestCase
. Non-conversational metrics (which are metrics used for individual LLMTestCase
s instead), when applied to a ConversationalTestCase
, will simply evaluate Message
s in a ConversationalTestCase
individually depending on whether should_evaluate
is True
.
Similar to how the term 'test case' refers to an LLMTestCase
if not explicitly specified, the term 'metrics' also refer to non-conversational metrics throughout deepeval
.
Message
A Message
in deepeval
is what makes up a ConversationalTestCase
. It has the following parameters:
llm_test_case
: anLLMTestCase
, conversational metrics will evaluate a conversation based on the providedLLMTestCase
for eachMessage
.should_evaluate
: a boolean when set toTrue
, will enable the evaluation of saidllm_test_case
via non-conversationaldeepeval
metrics, as if it is an individualLLMTestCase
. When not specified, theshould_evaluate
parameter is defaulted toFalse
for allMessage
s except for the last one in aConversationalTestCase
. Theshould_evaluate
parameter's only purpose is to determine whether it should be evaluated individually when a non-conversational metric is applied, and has no bearing on whether it should be evaluated or considered for evaluation for conversational metrics.
from deepeval.test_case import LLMTestCase, Message
message = Message(llm_test_case=LLMTestCase(...), should_evaluate=True)
Most metrics in deepeval
are non-conversational metrics, and the reason why should_evaluate
is defaulted to True
for the last message in a ConversationalTestCase
, is because often times users prefer evaluating the next best LLM response given the previous conversation context, instead of all Message
s in a ConversationalTestCase
.
Assert A Test Case
Before we begin going through the final sections, we highly recommend you to login to Confident AI (the platform powering deepeval) via the CLI. This way, you can keep track of all evaluation results generated each time you execute deepeval test run
.
deepeval login
Similar to Pytest, deepeval
allows you to assert any test case you create by calling the assert_test
function by running deepeval test run
via the CLI.
A test case passes only if all metrics passess. Depending on the metric, a combination of input
, actual_output
, expected_output
, context
, and retrieval_context
is used to ascertain whether their criterion have been met.
# A hypothetical LLM application example
import chatbot
import deepeval
from deepeval import assert_test
from deepeval.metrics import HallucinationMetric
from deepeval.test_case import LLMTestCase
def test_assert_example():
input = "Why did the chicken cross the road?"
test_case = LLMTestCase(
input=input,
actual_output=chatbot.run(input),
context=["The chicken wanted to cross the road."],
)
metric = HallucinationMetric(threshold=0.7)
assert_test(test_case, metrics=[metric])
# Optional. Log hyperparameters to pick the best hyperparameter for your LLM application
# using Confident AI. (run `deepeval login` in the CLI to login)
@deepeval.log_hyperparameters(model="gpt-4", prompt_template="...")
def hyperparameters():
# Return a dict to log additional hyperparameters.
# You can also return an empty dict {} if there's no additional parameters to log
return {
"temperature": 1,
"chunk size": 500
}
There are two mandatory and one optional parameter when calling the assert_test()
function:
test_case
: anLLMTestCase
metrics
: a list of metrics of typeBaseMetric
- [Optional]
run_async
: a boolean which when set toTrue
, enables concurrent evaluation of all metrics. Defaulted toTrue
.
The run_async
parameter overrides the async_mode
property of all metrics being evaluated. The async_mode
property, as you'll learn later in the metrics section, determines whether each metric can execute asynchronously.
To execute the test cases, run deepeval test run
via the CLI, which uses deepeval
's Pytest integration under the hood to execute these tests. You can also include an optional -n
flag follow by a number (that determines the number of processes that will be used) to run tests in parallel.
deepeval test run test_assert_example.py -n 4
Evaluate Test Cases in Bulk
Lastly, deepeval
offers an evaluate
function to evaluate multiple test cases at once, which similar to assert_test
but without the need for Pytest or the CLI.
# A hypothetical LLM application example
import chatbot
from deepeval import evaluate
from deepeval.metrics import HallucinationMetric
from deepeval.test_case import LLMTestCase
test_case = LLMTestCase(
input=input,
actual_output=chatbot.run(input),
context=["The chicken wanted to cross the road."],
)
metric = HallucinationMetric(threshold=0.7)
evaluate([test_case], [metric])
There are two mandatory and eight optional arguments when calling the evaluate()
function:
test_cases
: a list ofLLMTestCase
s/ConversationalTestCase
s, or anEvaluationDataset
metrics
: a list of metrics of typeBaseMetric
- [Optional]
hyperparameters
: a dict of typedict[str, Union[str, int, float]]
. You can log any arbitrary hyperparameter associated with this test run to pick the best hyperparameters for your LLM application on Confident AI. - [Optional]
run_async
: a boolean which when set toTrue
, enables concurrent evaluation of test cases AND metrics. Defaulted toTrue
. - [Optional]
throttle_value
: an integer that determines how long (in seconds) to throttle the evaluation of each test case. You can increase this value if your evaluation model is running into rate limit errors. Defaulted to 0. - [Optional]
ignore_errors
: a boolean which when set toTrue
, ignores all exceptions raised during metrics execution for eac test case. Defaulted toFalse
. - [Optional]
verbose_mode
: a optional boolean which when IS NOTNone
, overrides each metric'sverbose_mode
value. Defaulted toNone
. - [Optional]
write_cache
: a boolean which when set toTrue
, uses writes test run results to DISK. Defaulted toTrue
. - [Optional]
use_cache
: a boolean which when set toTrue
, uses cached test run results instead. Defaulted toFalse
. - [Optional]
show_indicator
: a boolean which when set toTrue
, shows the evaluation progress indicator for each individual metric. Defaulted toTrue
. - [Optional]
print_results
: a boolean which when set toTrue
, prints the result of each evaluation. Defaulted toTrue
.
Similar to assert_test
, evaluate
allows you to log and view test results and the hyperparameters associated with each on Confident AI.
deepeval login
from deepeval import evaluate
...
evaluate(
test_cases=[test_case],
metrics=[metric],
hyperparameters={"model": "gpt4o", "prompt template": "..."}
)
For more examples of evaluate
, visit the datasets section.
Labeling Test Cases for Confident AI
If you're using Confident AI, the optional name
parameter allows you to provide a string identifier to label LLMTestCase
s and ConversationalTestCase
s for you to easily search and filter for on Confident AI. This is particularly useful if you're importing test cases from an external datasource.
from deepeval.test_case import LLMTestCase, ConversationalTestCase
test_case = LLMTestCase(name="my-external-unique-id", ...)
convo_test_case = ConversationalTestCase(name="my-external-unique-id", ...)