Running an Evaluation
With our metrics in place, we’re finally ready to run our first evaluation. Using the five input-response pairs we reviewed from the previous section, we’ll construct a dataset, evaluate it using the 3 metrics we defined, and analyze the results on Confident AI.
Creating Your Dataset
The first step is to create a dataset that contains the 5 previous input-response pairs as LLMTestCase
s.
from deepeval.test_case import LLMTestCase
# test cases truncated for better readability...
test_case_0 = LLMTestCase(
input="Hey, I've been experiencing headaches lately.",
actual_output="I can help you with that. Could you pleas...",
retrieval_context=["Headaches can have various causes, in..."]
)
test_case_1 = LLMTestCase(
input="I'm been coughing a lot lately.",
actual_output="Could you please provide your name, the da..."
)
test_case_2 = LLMTestCase(
input="I'm feeling pretty sick, can I book an appointment...",
actual_output="Could you please provide more specific det..."
)
test_case_3 = LLMTestCase(
input="My name is John Doe. My email is johndoe@gmail.com...",
actual_output="Please confirm the following details: Name..."
)
test_case_4 = LLMTestCase(
input="My nose has been running constantly. Could it be s...",
actual_output="A runny nose is always caused by harmless ...",
retrieval_context=["A runny nose can be caused by common ..."]
)
Although we've precomputed the actual_output
and the retrieval_context
for each of the test cases in example above, you'll ideally generate these parameters from your LLM application at evaluation time.
The parameters we need to populate LLMTestCase
s with depend on the metrics we’ve chosen and what parameters they require. The union of those required parameters is what we must define for each LLMTestCase
. You can find the LLMTestCase
parameters required for each metric in their respective metric documentation pages:
AnswerRelevancyMetric
FaithfulnessMetric
GEval
(for professionalism)
However, you'll notice that although the FaithfulnessMetric
requires retrieval_context
, some LLMTestCase
s simply does not have one. This could be because for some LLM interaction there requires no retrieval_context
, and in that case, you should leave it out and not provide an empty list. While under normal circumstances this will cause the metric to error during evaluation, you'll learn in later sections that you can actually set the skip_on_missing_params
argument to True
to not run metrics on LLMTestCase
where the required parameters are missing during evaluation:
...
evaluate(..., skip_on_missing_params=True)
Finally, we'll define an EvaluationDataset
to group together the collection of LLMTestCase
s we've defined.
from deepeval.dataset import EvaluationDataset
...
dataset = EvaluationDataset(test_cases=[test_case_0, test_case_1, test_case_2, test_case_3, test_case_4])
Running An Evaluation
With our dataset and metrics ready, we can begin running our first evaluation. Import the evaluate()
function from DeepEval and provide your dataset and metrics in the appropriate fields. Optionally, you can also log your LLM application's hyperparameters, which will be extremely important for optimizing them in future iterations.
from deepeval import evaluate
...
evaluate(
dataset,
metrics=[answer_relevancy_metric, faithfulness_metric, professionalism_metric],
skip_on_missing_params=True,
hyperparameters={"model": MODEL, "prompt template": SYSTEM_PROMPT, "temperature": TEMPERATURE}
)
You can should definitely learn more about the ways in which you can customize the evaluate()
function here.
If you don't know where the metrics are coming from, you should go back and revisit this section on defining metrics in deepeval
. Similarly, if you don't know where MODEL
, SYSTEM_PROMPT
, and TEMPERATRE
are coming from, revisit this section.
Analyzing Your Testing Report
You'll be redirected to the following page on Confident AI once your evaluation is complete. Here, you'll find the testing report generated for the 5 test cases that we defined and evaluated in the previous steps. Each test case displays its status (pass or fail), input, and corresponding actual output. Clicking on a test case will reveal the remaining parameters and metric scores.
A test case is only considered passing if it exceeds the threshold for all the metrics it was evaluated on.
You can clearly see that while 2 of our test cases passed, 3 failed. In the next steps, we'll be going through each result, test case by test case, to understand what caused them to fail, which will help us make the relevant and necessary improvements to our LLM application in subsequent sections.
First Test Case
Here, you can see that our first test case excels across all metrics, achieving perfect scores in Answer Relevancy and Faithfulness, and scoring 0.93 in Professionalism. Since nothing is failing, we don't need to inspect this test case further.
Second Test Case
On the other hand, the second test case passes the faithfulness and professionalism but scores a 0 for Answer Relevancy, which aligns with the expectations we've set for this particular test case when we defined our evaluation criteria.
To further analyze the metric scores, click on the inspect icon located at the top right-hand corner of each test case.
Confident AI makes it easy to analyze metric scores by offering clear reasons for why each metric score failed or succeeded, along with detailed verbose logs for visibility into how each score was calculated.
The reason also aligns with our expectations, since the test case is failing because the output contains irrelevant information about appointment details, which does not address the concern about coughing.
Third Test Case
Similar to the first test case, the third test case passes all the metrics with near-perfect scores, so we won't be needing to inspect this test case further.
Fourth Test Case
The 4th test case narrowly failed Professionalism with a score of 0.61, while narrowly passing Answer Relevancy with a score of 0.71.
Test cases with borderline metric scores require extra attention, as they tread a fine line between acceptable and unacceptable performance. Careful consideration is needed to determine whether adjustment in the LLM is necessary, as it could potentially impact other areas of your application that you don't know about.
Analyzing borderline test cases in a CSV or JSON file can be exhausting and tedious. Fortunately, Confident AI simplifies this process by allowing you to filter by metric scores and score ranges, making it easy to identify the test cases you care about.
Let's analyze why Professionalism is failing. The provided reason states that while the language is clear and uses appropriate medical terms, it lacks empathy and support in its tone, which is inconsistent with professional medical norms.
Fifth Test Case
Finally, let's examine the 5th and final test case, which fails all 3 metrics.
Inspecting Answer Relevancy reveals that the response includes multiple irrelevant statements and fails to address the seriousness of a constantly runny nose, which was the primary focus of the input.
Analyzing Professionalism and Contextual Relevance shows that Professionalism fails because the response is overly simplistic and dismissive of potential concerns. Faithfulness fails because the output incorrectly claims that a runny nose is only caused by non-serious conditions, failing to acknowledge more serious possibilities like CSF leaks or chronic sinusitis.
Summary
Our evaluation results revealed failures in test cases 2, 4, and 5, highlighting areas that need targeted enhancements to improve the LLM’s accuracy, tone, and contextual understanding.
- Second test case: Failed Answer Relevancy because the response included irrelevant details, such as appointment information, rather than addressing the primary concern about coughing. To improve, the LLM’s context-awareness should better prioritize user intent and exclude extraneous information.
- Fourth test case: Failed Professionalism due to an overly simplistic and dismissive tone. It also narrowly passed Answer Relevancy with a borderline score. Improvements should focus on refining the tone to ensure empathetic, professional responses and enhancing content prioritization to align more effectively with queries.
- Fifth test case: Failed Answer Relevancy, Professionalism, and Faithfulness. The response included irrelevant statements, was dismissive of serious concerns, and contained factual inaccuracies, such as failing to acknowledge serious conditions like CSF leaks. Fixes require strengthening factual accuracy, improving context parsing, and refining tone calibration.
By addressing these failures and adjusting the relevant hyperparameters, our medical chatbot can deliver more accurate, empathetic, and contextually aligned responses, ensuring better performance and user satisfaction.
Confident AI offers a suite of features to make optimizing hyperparameter easy and effective. We'll be exploring how to use these tools in the upcoming section.