Summarization
The summarization metric uses LLM-as-a-judge to determine whether your LLM (application) is generating factually correct summaries while including the neccessary details from the original text. In a summarization task within deepeval
, the original text refers to the input
while the summary is the actual_output
.
The SummarizationMetric
is the only default metric in deepeval
that is not cacheable.
Required Arguments
To use the SummarizationMetric
, you'll have to provide the following arguments when creating an LLMTestCase
:
input
actual_output
The input
and actual_output
are required to create an LLMTestCase
(and hence required by all metrics) even though they might not be used for metric calculation. Read the How Is It Calculated section below to learn more.
Example
Let's take this input
and actual_output
as an example:
# This is the original text to be summarized
input = """
The 'coverage score' is calculated as the percentage of assessment questions
for which both the summary and the original document provide a 'yes' answer. This
method ensures that the summary not only includes key information from the original
text but also accurately represents it. A higher coverage score indicates a
more comprehensive and faithful summary, signifying that the summary effectively
encapsulates the crucial points and details from the original content.
"""
# This is the summary, replace this with the actual output from your LLM application
actual_output="""
The coverage score quantifies how well a summary captures and
accurately represents key information from the original text,
with a higher score indicating greater comprehensiveness.
"""
You can use the SummarizationMetric
as follows:
from deepeval import evaluate
from deepeval.test_case import LLMTestCase
from deepeval.metrics import SummarizationMetric
...
test_case = LLMTestCase(input=input, actual_output=actual_output)
metric = SummarizationMetric(
threshold=0.5,
model="gpt-4",
assessment_questions=[
"Is the coverage score based on a percentage of 'yes' answers?",
"Does the score ensure the summary's accuracy with the source?",
"Does a higher score mean a more comprehensive summary?"
]
)
# To run metric as a standalone
# metric.measure(test_case)
# print(metric.score, metric.reason)
evaluate(test_cases=[test_case], metrics=[metric])
There are NINE optional parameters when instantiating an SummarizationMetric
class:
- [Optional]
threshold
: the passing threshold, defaulted to 0.5. - [Optional]
assessment_questions
: a list of close-ended questions that can be answered with either a 'yes' or a 'no'. These are questions you want your summary to be able to ideally answer, and is especially helpful if you already know what a good summary for your use case looks like. Ifassessment_questions
is not provided, we will generate a set ofassessment_questions
for you at evaluation time. Theassessment_questions
are used to calculate thecoverage_score
. - [Optional]
n
: the number of assessment questions to generate whenassessment_questions
is not provided. Defaulted to 5. - [Optional]
model
: a string specifying which of OpenAI's GPT models to use, OR any custom LLM model of typeDeepEvalBaseLLM
. Defaulted to 'gpt-4o'. - [Optional]
include_reason
: a boolean which when set toTrue
, will include a reason for its evaluation score. Defaulted toTrue
. - [Optional]
strict_mode
: a boolean which when set to True, enforces a strict evaluation criterion. In strict mode, the metric score becomes binary: a score of 1 indicates a perfect result, and any outcome less than perfect is scored as 0. Defaulted asFalse
. - [Optional]
async_mode
: a boolean which when set toTrue
, enables concurrent execution within themeasure()
method. Defaulted toTrue
. - [Optional]
verbose_mode
: a boolean which when set toTrue
, prints the intermediate steps used to calculate said metric to the console, as outlined in the How Is It Calculated section. Defaulted toFalse
. - [Optional]
truths_extraction_limit
: an int which when set, determines the maximum number of factual truths to extract from theinput
. The truths extracted will used to determine thealignment_score
, and will be ordered by importance, decided by your evaluationmodel
. Defaulted toNone
.
Sometimes, you may want to only consider the most important factual truths in the input
. If this is the case, you can choose to set the truths_extraction_limit
parameter to limit the maximum number of truths to consider during evaluation.
As a standalone
You can also run the SummarizationMetric
on a single test case as a standalone, one-off execution.
...
metric.measure(test_case)
print(metric.score, metric.reason)
This is great for debugging or if you wish to build your own evaluation pipeline, but you will NOT get the benefits (testing reports, Confident AI platform) and all the optimizations (speed, caching, computation) the evaluate()
function or deepeval test run
offers.
How Is It Calculated?
The SummarizationMetric
score is calculated according to the following equation:
To break it down, the:
alignment_score
determines whether the summary contains hallucinated or contradictory information to the original text.coverage_score
determines whether the summary contains the neccessary information from the original text.
While the alignment_score
is similar to that of the HallucinationMetric
, the coverage_score
is first calculated by generating n
closed-ended questions that can only be answered with either a 'yes or a 'no', before calculating the ratio of which the original text and summary yields the same answer. Here is a great article on how deepeval
's summarization metric was build.
You can access the alignment_score
and coverage_score
from a SummarizationMetric
as follows:
from deepeval.metrics import SummarizationMetric
from deepeval.test_case import LLMTestCase
...
test_case = LLMTestCase(...)
metric = SummarizationMetric(...)
metric.measure(test_case)
print(metric.score)
print(metric.reason)
print(metric.score_breakdown)
Since the summarization score is the minimum of the alignment_score
and coverage_score
, a 0 value for either one of these scores will result in a final summarization score of 0.