Running an Evaluation
With our merics chosen, we can finally begin running evaluations. In order to do so, we'll need to construct a dataset with the documents we want to summarize and generate summaries for them using our LLM summarizer. This will allow us to directly apply our metrics to the dataset when running evaluations.
You'll want to login to Confident AI before running an evaluation to save our evaluation results and easily analyze them in a report format.
deepeval login
Constructing a Dataset
If you're building a document summarizer, you may have already have a folder of documents or PDFs waiting to be summarized. If that's the case, you'll first want to parse these PDFs into strings that can be passed into your LLM summarizer. Here's how you can do that using PyPDF2
.
import os
import PyPDF2
def extract_text(pdf_path):
"""Extract text from a PDF file."""
with open(pdf_path, "rb") as file:
reader = PyPDF2.PdfReader(file)
text = "\n".join(page.extract_text() for page in reader.pages if page.extract_text())
return text
# Replace with your folder containing PDFs
pdf_folder = "path/to/pdf/folder"
documents = [] # List to store extracted document strings
# Iterate over PDF files in the folder
for pdf_file in os.listdir(pdf_folder):
if pdf_file.endswith(".pdf"):
pdf_path = os.path.join(pdf_folder, pdf_file)
document_text = extract_text(pdf_path)
documents.append(document_text) # Store extracted text
Next, you'll need to pass these documents into your legal document summarizer llm.summarize()
and generate the summaries for each of them.
from deepeval.test_case import LLMTestCase
from deepeval.dataset import EvaluationDataset
from some_llm_library import llm # Replace with the actual LLM summarizer
# Convert document strings to test cases with LLM summaries
test_cases = [LLMTestCase(input=doc, actual_output=llm.summarize(doc)) for doc in documents]
# Create the evaluation dataset
dataset = EvaluationDataset(test_cases=test_cases)
An EvaluationDataset
consists of a series of test cases. Each test case contains an input
, which represents the document we feed into the summarizer, and the actual_output
, which is the summary generated by the LLM. More on test cases here.
Keep in mind that, for the sake of this tutorial, our EvaluationDataset
consists of 5 test cases (5 documents), and our first test_case
corresponds to the service agreement we inspected when we first defined our evaluation criteria in the previous sections.
print(dataset.test_cases[0].input)
#CONTRACT FOR SERVICES...
print(dataset.test_cases[0].actual_output)
#This agreement establishes...
print(len(dataset.test_cases))
# 5
With that, let's begin running our first evaluation.
Running an Evaluation
To run an evaluation, first login to Confident AI.
deepeval login
Then, pass the concision and completeness metrics we defined in the previous section along with the dataset we just created into the evaluate
function.
from deepeval.evaluate
evaluate(dataset, metrics=[concision_metric, completeness_metric])
The evaluate
function offers flexible customization for how you want to run evaluations, such as allowing you to control concurrency for asynchronous operations or manage error handling in different ways. You can learn more about these options here.
Analyzing your Test Report
Once your evaluation is complete, you'll be redirected to the test cases page on Confident AI, which will display the evaluation report for the 5 document summaries we generated.
Each test case includes a status (pass or fail), input (document), and actual output (summary). A test case is considered failling if any one of its metric score fails to meet the metric threshold.
Identifying the Failing Test Case
As shown below, only one of five summaries we generated failed to pass. A further inspection reveals that the failing test case was indeed the very first summary we generated when we defined our evaluation criteria. More specifically, while the summary achieved a concision score 0.77, the completeness score did not meet the threshold, falling below the 0.5 requirement.
Click to see failing test case
failing_llm_test_case = LLMTestCase(
input="""
AGREEMENT FOR SOFTWARE DEVELOPMENT SERVICES
This Agreement ("Agreement") is made and entered into as of February 1, 2025, by and between Acme Solutions, Inc., a corporation duly organized and existing under the laws of the State of Delaware, with its principal place of business at 123 Tech Lane, San Francisco, CA 94107 ("Provider"), and BetaCorp LLC, a limited liability company duly organized and existing under the laws of the State of New York, with its principal place of business at 456 Business Street, New York, NY 10001 ("Client").
WHEREAS, Provider is engaged in the business of software development and consulting, specializing in the design, implementation, and maintenance of custom technology solutions;
WHEREAS, Client desires to engage Provider to perform software development and consulting services as outlined in this Agreement;
NOW, THEREFORE, in consideration of the mutual covenants and promises set forth herein, the parties agree as follows:
1. SERVICES
Provider agrees to develop, test, and deploy software solutions tailored to Client’s business needs as detailed in Exhibit A. Provider shall follow industry best practices and deliver work in accordance with the project timeline outlined therein. Any modifications or additional services requested by Client shall require a written change order, subject to additional fees and revised timelines.
Provider shall use commercially reasonable efforts to ensure that all deliverables are free of defects and meet the functional requirements agreed upon by both parties. Provider shall conduct periodic progress updates and report to Client at regular intervals throughout the engagement.
2. COMPENSATION
Client agrees to compensate Provider for the services rendered under this Agreement in the total amount of Fifty Thousand Dollars ($50,000). Payment shall be made in five (5) equal monthly installments of Ten Thousand Dollars ($10,000) each, payable on the first day of each month, beginning February 1, 2025, and concluding June 1, 2025.
In the event that additional work is required beyond the scope defined in Exhibit A, such work shall be billed at a rate of One Hundred Fifty Dollars ($150) per hour unless otherwise agreed in writing.
Failure to make timely payments may result in suspension of services until outstanding amounts are paid. If payments remain unpaid for more than thirty (30) days, a late fee of 1.5% per month shall be applied to any overdue amount.
3. INTELLECTUAL PROPERTY
All software, documentation, source code, and other materials produced under this Agreement shall be the exclusive property of Client. Provider agrees that all deliverables created shall be considered work-for-hire as defined under U.S. copyright law, and all rights shall be assigned to Client upon full payment.
Provider retains no rights to reuse, resell, or otherwise distribute any software developed under this Agreement unless explicitly authorized by Client in writing. Provider agrees not to use any proprietary Client materials for any purpose outside the scope of this Agreement.
4. CONFIDENTIALITY
Both parties acknowledge that during the term of this Agreement, they may have access to confidential and proprietary information. Each party agrees to maintain the confidentiality of such information and not disclose it to third parties without prior written consent.
Confidential information shall include but is not limited to trade secrets, business strategies, technical specifications, and any other non-public information. The obligations under this section shall survive the termination of this Agreement for a period of five (5) years.
5. WARRANTIES AND REPRESENTATIONS
Provider warrants that all services performed shall be carried out in a professional manner, consistent with industry standards. Provider further warrants that all software developed under this Agreement shall be free from material defects and shall perform substantially as specified for a period of ninety (90) days following delivery.
Client acknowledges that software development is inherently complex, and Provider does not warrant that the software will be completely error-free. However, Provider agrees to remedy any material defects reported within the warranty period at no additional cost.
6. TERM AND TERMINATION
This Agreement shall commence on February 1, 2025, and continue through August 1, 2025, unless extended or terminated earlier under the provisions herein. Either party may terminate this Agreement upon thirty (30) days' written notice to the other party in the event of a material breach, provided that the breaching party fails to cure such breach within fifteen (15) days of written notice.
Upon termination, Client shall pay Provider for all services rendered and work completed up to the effective date of termination. If termination occurs prior to the delivery of the final product, Client shall compensate Provider based on the percentage of work completed.
7. LIABILITY LIMITATIONS
In no event shall either party be liable for any indirect, incidental, special, or consequential damages arising from this Agreement, including but not limited to lost profits, data loss, or business interruption, even if advised of the possibility of such damages.
Provider's total aggregate liability under this Agreement shall not exceed the total fees paid by Client to Provider under this Agreement.
8. INDEMNIFICATION
Each party shall indemnify, defend, and hold harmless the other party and its respective officers, directors, employees, and agents from and against any claims, liabilities, damages, and expenses arising from any negligent act, omission, or breach of this Agreement.
Client agrees to indemnify Provider against any claims related to the use of the software in production, except where such claims arise from defects introduced by Provider.
9. FORCE MAJEURE
Neither party shall be liable for any failure or delay in performance under this Agreement due to causes beyond its reasonable control, including but not limited to acts of God, natural disasters, war, terrorism, government regulations, labor disputes, or power failures.
10. DISPUTE RESOLUTION
In the event of a dispute arising from this Agreement, the parties agree to first attempt resolution through good-faith negotiations. If a resolution cannot be reached within thirty (30) days, the dispute shall be resolved through binding arbitration conducted in the State of California in accordance with the rules of the American Arbitration Association.
11. GENERAL PROVISIONS
a) Governing Law: This Agreement shall be governed by and construed in accordance with the laws of the State of California.
b) Entire Agreement: This Agreement constitutes the entire understanding between the parties and supersedes all prior agreements, written or oral.
c) Assignment: Neither party may assign or transfer its rights or obligations under this Agreement without the prior written consent of the other party.
d) Notices: Any notices required under this Agreement shall be in writing and delivered to the respective party’s principal place of business.
e) Severability: If any provision of this Agreement is found to be unenforceable, the remainder of the Agreement shall remain in full force and effect.
IN WITNESS WHEREOF, the parties have executed this Agreement as of the Effective Date.
Signed,
Acme Solutions, Inc.
By: __________________________
Name: John Doe
Title: CEO
BetaCorp LLC
By: __________________________
Name: Jane Smith
Title: Managing Partner
""",
actual_output="This agreement outlines a software development and consulting arrangement between Acme Solutions, Inc. and BetaCorp LLC. Acme Solutions will provide software services as defined in an exhibit, with work commencing on February 1, 2025, and lasting for a set period unless extended. Compensation is structured as multiple installments, with penalties for late payments mentioned but not detailed. The contract also covers confidentiality, stating that proprietary information must be protected. Intellectual property rights grant ownership to the client, though usage restrictions are not fully explained. Dispute resolution involves arbitration, but jurisdiction specifics are not included here. Liability limitations and termination clauses are covered, but details on service expectations, warranty periods, and potential penalties are not fully elaborated. The contract also references force majeure conditions. Overall, the agreement defines responsibilities, payment terms, and legal considerations, though some specifics on financial penalties, intellectual property nuances, and governing laws are not fully captured in this summary.",
)
You can easily investigate why the summary failed to score above the passing threshold for the completeness metric by clicking on the view test case details
button. This will show you the reason, which was generated automatically during evaluation.
The output captures the primary parties, service arrangement,
compensation, confidentiality, intellectual property, dispute
resolution, and force majeure, but omits exact payment terms
and governing law specifics, and does not detail service
expectations or warranty periods.
According to the reason, this particular test case is failing because the summary missed details on exact payment terms, which was a key detail in the original document. This failure is unacceptable, since we previously established that our summarizer could not afford to make such mistakes.
In the next section, we'll iterate on our summarizer's hyperparameters to improve the completeness factor in the summaries it generates, ensuring it meets this Completeness Metric's required passing threshold in our next evaluation.