Summarization
The summarization metric uses LLMs to determine whether your LLM (application) is generating factually correct summaries while including the neccessary details from the original text. In a summarization task within deepeval
, the original text refers to the input
while the summary is the actual_output
.
The SummarizationMetric
is the only default metric in deepeval
that is not cacheable.
Required Arguments
To use the SummarizationMetric
, you'll have to provide the following arguments when creating an LLMTestCase
:
input
actual_output
Example
Let's take this input
and actual_output
as an example:
# This is the original text to be summarized
input = """
The 'coverage score' is calculated as the percentage of assessment questions
for which both the summary and the original document provide a 'yes' answer. This
method ensures that the summary not only includes key information from the original
text but also accurately represents it. A higher coverage score indicates a
more comprehensive and faithful summary, signifying that the summary effectively
encapsulates the crucial points and details from the original content.
"""
# This is the summary, replace this with the actual output from your LLM application
actual_output="""
The coverage score quantifies how well a summary captures and
accurately represents key information from the original text,
with a higher score indicating greater comprehensiveness.
"""
You can use the SummarizationMetric
as follows:
from deepeval import evaluate
from deepeval.metrics import SummarizationMetric
from deepeval.test_case import LLMTestCase
...
test_case = LLMTestCase(input=input, actual_output=actual_output)
metric = SummarizationMetric(
threshold=0.5,
model="gpt-4",
assessment_questions=[
"Is the coverage score based on a percentage of 'yes' answers?",
"Does the score ensure the summary's accuracy with the source?",
"Does a higher score mean a more comprehensive summary?"
]
)
metric.measure(test_case)
print(metric.score)
print(metric.reason)
# or evaluate test cases in bulk
evaluate([test_case], [metric])
There are nine optional parameters when instantiating an SummarizationMetric
class:
- [Optional]
threshold
: the passing threshold, defaulted to 0.5. - [Optional]
assessment_questions
: a list of close-ended questions that can be answered with either a 'yes' or a 'no'. These are questions you want your summary to be able to ideally answer, and is especially helpful if you already know what a good summary for your use case looks like. Ifassessment_questions
is not provided, we will generate a set ofassessment_questions
for you at evaluation time. Theassessment_questions
are used to calculate thecoverage_score
. - [Optional]
n
: the number of assessment questions to generate whenassessment_questions
is not provided. Defaulted to 5. - [Optional]
model
: a string specifying which of OpenAI's GPT models to use, OR any custom LLM model of typeDeepEvalBaseLLM
. Defaulted to 'gpt-4o'. - [Optional]
include_reason
: a boolean which when set toTrue
, will include a reason for its evaluation score. Defaulted toTrue
. - [Optional]
strict_mode
: a boolean which when set to True, enforces a strict evaluation criterion. In strict mode, the metric score becomes binary: a score of 1 indicates a perfect result, and any outcome less than perfect is scored as 0. Defaulted asFalse
. - [Optional]
async_mode
: a boolean which when set toTrue
, enables concurrent execution within themeasure()
method. Defaulted toTrue
. - [Optional]
verbose_mode
: a boolean which when set toTrue
, prints the intermediate steps used to calculate said metric to the console, as outlined in the How Is It Calculated section. Defaulted toFalse
. - [Optional]
truths_extraction_limit
: an int which when set, determines the maximum number of factual truths to extract from theinput
. The truths extracted will used to determine thealignment_score
, and will be ordered by importance, decided by your evaluationmodel
. Defaulted toNone
.
Sometimes, you may want to only consider the most important factual truths in the input
. If this is the case, you can choose to set the truths_extraction_limit
parameter to limit the maximum number of truths to consider during evaluation.
How Is It Calculated?
The SummarizationMetric
score is calculated according to the following equation:
To break it down, the:
alignment_score
determines whether the summary contains hallucinated or contradictory information to the original text.coverage_score
determines whether the summary contains the neccessary information from the original text.
While the alignment_score
is similar to that of the HallucinationMetric
, the coverage_score
is first calculated by generating n
closed-ended questions that can only be answered with either a 'yes or a 'no', before calculating the ratio of which the original text and summary yields the same answer. Here is a great article on how deepeval
's summarization metric was build.
You can access the alignment_score
and coverage_score
from a SummarizationMetric
as follows:
from deepeval.metrics import SummarizationMetric
from deepeval.test_case import LLMTestCase
...
test_case = LLMTestCase(...)
metric = SummarizationMetric(...)
metric.measure(test_case)
print(metric.score)
print(metric.reason)
print(metric.score_breakdown)
Since the summarization score is the minimum of the alignment_score
and coverage_score
, a 0 value for either one of these scores will result in a final summarization score of 0.