Summarization

The summarization metric uses LLM-as-a-judge to determine whether your LLM (application) is generating factually correct summaries while including the neccessary details from the original text. In a summarization task within deepeval, the original text refers to the input while the summary is the actual_output.

note

The SummarizationMetric is the only default metric in deepeval that is not cacheable.

Required Arguments

To use the SummarizationMetric, you'll have to provide the following arguments when creating an LLMTestCase:

input
actual_output

The input and actual_output are required to create an LLMTestCase (and hence required by all metrics) even though they might not be used for metric calculation. Read the How Is It Calculated section below to learn more.

Example

Let's take this input and actual_output as an example:

# This is the original text to be summarized
input = """
The 'coverage score' is calculated as the percentage of assessment questions
for which both the summary and the original document provide a 'yes' answer. This
method ensures that the summary not only includes key information from the original
text but also accurately represents it. A higher coverage score indicates a
more comprehensive and faithful summary, signifying that the summary effectively
encapsulates the crucial points and details from the original content.
"""

# This is the summary, replace this with the actual output from your LLM application
actual_output="""
The coverage score quantifies how well a summary captures and
accurately represents key information from the original text,
with a higher score indicating greater comprehensiveness.
"""

You can use the SummarizationMetric as follows:

from deepeval import evaluate
from deepeval.test_case import LLMTestCase
from deepeval.metrics import SummarizationMetric
...

test_case = LLMTestCase(input=input, actual_output=actual_output)
metric = SummarizationMetric(
    threshold=0.5,
    model="gpt-4",
    assessment_questions=[
        "Is the coverage score based on a percentage of 'yes' answers?",
        "Does the score ensure the summary's accuracy with the source?",
        "Does a higher score mean a more comprehensive summary?"
    ]
)

# To run metric as a standalone
# metric.measure(test_case)
# print(metric.score, metric.reason)

evaluate(test_cases=[test_case], metrics=[metric])

There are NINE optional parameters when instantiating an SummarizationMetric class:

[Optional] threshold: the passing threshold, defaulted to 0.5.
[Optional] assessment_questions: a list of close-ended questions that can be answered with either a 'yes' or a 'no'. These are questions you want your summary to be able to ideally answer, and is especially helpful if you already know what a good summary for your use case looks like. If assessment_questions is not provided, we will generate a set of assessment_questions for you at evaluation time. The assessment_questions are used to calculate the coverage_score.
[Optional] n: the number of assessment questions to generate when assessment_questions is not provided. Defaulted to 5.
[Optional] model: a string specifying which of OpenAI's GPT models to use, OR any custom LLM model of type DeepEvalBaseLLM. Defaulted to 'gpt-4o'.
[Optional] include_reason: a boolean which when set to True, will include a reason for its evaluation score. Defaulted to True.
[Optional] strict_mode: a boolean which when set to True, enforces a strict evaluation criterion. In strict mode, the metric score becomes binary: a score of 1 indicates a perfect result, and any outcome less than perfect is scored as 0. Defaulted as False.
[Optional] async_mode: a boolean which when set to True, enables concurrent execution within the measure() method. Defaulted to True.
[Optional] verbose_mode: a boolean which when set to True, prints the intermediate steps used to calculate said metric to the console, as outlined in the How Is It Calculated section. Defaulted to False.
[Optional] truths_extraction_limit: an int which when set, determines the maximum number of factual truths to extract from the input. The truths extracted will used to determine the alignment_score, and will be ordered by importance, decided by your evaluation model. Defaulted to None.

note

Sometimes, you may want to only consider the most important factual truths in the input. If this is the case, you can choose to set the truths_extraction_limit parameter to limit the maximum number of truths to consider during evaluation.

As a standalone

You can also run the SummarizationMetric on a single test case as a standalone, one-off execution.

...

metric.measure(test_case)
print(metric.score, metric.reason)

caution

This is great for debugging or if you wish to build your own evaluation pipeline, but you will NOT get the benefits (testing reports, Confident AI platform) and all the optimizations (speed, caching, computation) the evaluate() function or deepeval test run offers.

How Is It Calculated?

The SummarizationMetric score is calculated according to the following equation:

\text{Summarization} = \min(\text{Alignment Score}, \text{Coverage Score})

To break it down, the:

alignment_score determines whether the summary contains hallucinated or contradictory information to the original text.
coverage_score determines whether the summary contains the neccessary information from the original text.

While the alignment_score is similar to that of the HallucinationMetric, the coverage_score is first calculated by generating n closed-ended questions that can only be answered with either a 'yes or a 'no', before calculating the ratio of which the original text and summary yields the same answer. Here is a great article on how deepeval's summarization metric was build.

You can access the alignment_score and coverage_score from a SummarizationMetric as follows:

from deepeval.metrics import SummarizationMetric
from deepeval.test_case import LLMTestCase
...

test_case = LLMTestCase(...)
metric = SummarizationMetric(...)

metric.measure(test_case)
print(metric.score)
print(metric.reason)
print(metric.score_breakdown)

note

Since the summarization score is the minimum of the alignment_score and coverage_score, a 0 value for either one of these scores will result in a final summarization score of 0.

Required Arguments​

Example​

As a standalone​

How Is It Calculated?​

Required Arguments

Example

As a standalone

How Is It Calculated?