Defining an Evaluation Criteria for Summarization
Before selecting your metrics, you'll first need to define your evaluation criteria. In other words, identify what aspects of the summaries generated by your LLM matter to you—what makes a summary bad and what makes it good. This will shape the criteria you use to evaluate your LLM.
A well-defined evaluation criterion makes it easier to choose the right metrics for assessing your LLM summarizer.
For example, if clarity is a priority when summarizing lengthy and complex legal documents, you should choose metrics like conciseness, which assess how easily the summaries can be understood.
Generating Dummy Summaries
If you don't already have an evaluation criteria, generating summaries from a few randomly selected documents can help you identify which aspects matter most to you. For example, consider this service agreement contract (approximately three pages), which has been shortened for the sake of this example:
document_content = """
CONTRACT FOR SERVICES
This Service Agreement ("Agreement") is entered into on January 28, 2025, by and between Acme Solutions, Inc. ("Provider"), a corporation registered in Delaware, and BetaCorp LLC ("Client"), a limited liability company registered in California.
1. SERVICES: Provider agrees to perform software development and consulting services for Client as outlined in Exhibit A. Services will commence on February 1, 2025, and are expected to conclude by August 1, 2025, unless extended in writing.
2. COMPENSATION: Client shall pay Provider a fixed fee of $50,000, payable in five equal installments of $10,000 due on the first of each month starting February 1, 2025.
...
...
...
Signed,
Acme Solutions, Inc.
BetaCorp LLC
"""
Let's run the following code to generate the summary:
# replace llm with your LLM summarizer
summary = llm.summarize(document_content)
print(summary)
This yields the following results:
This agreement establishes a business relationship between Acme Solutions, Inc.,
a Delaware corporation, and BetaCorp LLC, a company based in California.
It specifies that Acme Solutions will provide software development and consulting
services to BetaCorp for a defined period, beginning February 1, 2025, and
potentially ending on August 1, 2025. The document includes details about the
responsibilities of each party, confidentiality obligations, and the termination
process, which requires 30 days' written notice. Additionally, it states that
California law will govern the agreement. However, no details are included regarding
the payment structure.
Immediately, you can see that there are 2 issues with the generated summary.
First, it’s too lengthy. The legal document summarizer we’re building is designed to help lawyers work efficiently, so keeping summaries concise is essential. Second, the summary omits compensation details, which is a significant problem. A complete summary (meaning no information is lost in the summary) is crucial in esuring that lawyers don’t miss any vital information in their fast-paced line of work.
Generating even more summaries can help reveal additional issues in your LLM summarizer, such as fluency problems (especially in non-English languages) or hallucinations that may not have existed in other summarization generations.
Defining Your Evaluation Criteria
From generating a single summary, we've already identified two key points that matter for our LLM summarizer:
- The summary must be concise.
- The summary must be complete.
These points define our evaluation criteria. In practice, you'll want to test your summarizer with as many documents as possible. The more examples you run, the more patterns and gaps you'll uncover, helping you refine and build a comprehensive set of evaluation criteria.
Your evaluation criteria are not set in stone. As your LLM application moves into production, ongoing user feedback will be essential for refining your evaluation criteria—which ultimately matters more than your initial priorities.
Next, let’s explore how to go from evaluation criteria to choosing metrics.