Answer Relevancy
The answer relevancy metric measures the quality of your RAG pipeline's generator by evaluating how relevant the actual_output
of your LLM application is compared to the provided input
. deepeval
's answer relevancy metric is a self-explaining LLM-Eval, meaning it outputs a reason for its metric score.
Here is a detailed guide on RAG evaluation, which we highly recommend as it explains everything about deepeval
's RAG metrics.
Required Arguments
To use the AnswerRelevancyMetric
, you'll have to provide the following arguments when creating an LLMTestCase
:
input
actual_output
Example
from deepeval import evaluate
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase
# Replace this with the actual output from your LLM application
actual_output = "We offer a 30-day full refund at no extra cost."
metric = AnswerRelevancyMetric(
threshold=0.7,
model="gpt-4",
include_reason=True
)
test_case = LLMTestCase(
input="What if these shoes don't fit?",
actual_output=actual_output
)
metric.measure(test_case)
print(metric.score)
print(metric.reason)
# or evaluate test cases in bulk
evaluate([test_case], [metric])
There are six optional parameters when creating an AnswerRelevancyMetric
:
- [Optional]
threshold
: a float representing the minimum passing threshold, defaulted to 0.5. - [Optional]
model
: a string specifying which of OpenAI's GPT models to use, OR any custom LLM model of typeDeepEvalBaseLLM
. Defaulted to 'gpt-4o'. - [Optional]
include_reason
: a boolean which when set toTrue
, will include a reason for its evaluation score. Defaulted toTrue
. - [Optional]
strict_mode
: a boolean which when set toTrue
, enforces a binary metric score: 1 for perfection, 0 otherwise. It also overrides the current threshold and sets it to 1. Defaulted toFalse
. - [Optional]
async_mode
: a boolean which when set toTrue
, enables concurrent execution within themeasure()
method. Defaulted toTrue
. - [Optional]
verbose_mode
: a boolean which when set toTrue
, prints the intermediate steps used to calculate said metric to the console, as outlined in the How Is It Calculated section. Defaulted toFalse
.
How Is It Calculated?
The AnswerRelevancyMetric
score is calculated according to the following equation:
The AnswerRelevancyMetric
first uses an LLM to extract all statements made in the actual_output
, before using the same LLM to classify whether each statement is relevant to the input
.
You can set the verbose_mode
of ANY deepeval
metric to True
to debug the measure()
method:
...
metric = AnswerRelevancyMetric(verbose_mode=True)
metric.measure(test_case)