Catching Regressions from Prompt Template Changes
In the previous section, we attempted to improve our legal document summarizer's prompt template, but found that while the original test case that failed is now passing the completeness
and concision
, another test case has failed to pass the concision
metric.
In this section, we'll compare the results of our 2 evaluation test runs and identify further improvements to our summarizer's hyperparameters.
Confident AI allows you to compare test runs side by side, helping identify improvements and regressions so you can make the necessary adjustments to your hyperparameters.
Comparing Evaluation Test Runs
To compare evaluation test runs, go to the compare test runs page of the latest test run (the test run using our improved prompt template) and select the test run ID of the first evaluation.
A test case is highlighted in red if any metric scores regress and in green if there are improvements. Clicking on a metric row lets you compare the metric score and reasons side by side.
Inspecting Concision Regression
As hypothesized, our first test case is now passing because the summary is complete. However, the third test case's concision issue is quite interesting. Specifically, when comparing the metric reasons, the previous test case stated: "The actual output succinctly summarizes all essential elements from the original data." Now, it says: "The output is overly repetitive, stating the same key information multiple times." It's not only lacking concision but also repetitive.
View metric reason changes
new reason="The output is overly repetitive, stating the same key information multiple times, such as service details, penalty, confidentiality, liability, dispute resolution, and agreement terms. However, it includes essential information about parties, services, financials, confidentiality, liability, dispute resolution, and governance, promoting clarity and completeness."
old reason="The actual output succinctly summarizes all essential elements from the original data, including contract parties, services provided, response times, contract duration, payment terms, confidentiality, liability, dispute resolution, and governing law, without including any redundant information."
Let's try changing the model powering our legal document summarizer to see if the summaries become more robust.
model = "gpt-4.o"
Running Evaluations one Final Time
Changing the model resulted in all the test cases passing! However, it's important to acknowledge that our EvaluationDataset
consists of only five documents—an extremely small sample. As the number of documents increases, so does the potential for errors. But if your model consistently passes across a larger dataset, it will demonstrate true robustness.
In the next and final section, we'll learn how to easily create and maintain a dataset and retrieve it for evaluations with just two lines of code. This will allow you to systematically evaluate your LLM summarizer at scale. Let's begin!