Defining an Evaluation Criteria

The metrics you choose to evaluate your LLM depend on the evaluation criteria you define. These criteria will be used to assess the specific aspects of your LLM that you wish to improve, ensuring they align with your goals and user needs.

Defining a clear evaluation criteria is essential, as it transforms your subjective values (what matters to you and your users, and what you wish to improve) into actionable steps (selecting the right metrics).

tip

When defining your evaluation criteria, you'll need to consider what matters most to you, what your users care about, and where your model's performance could be enhanced. If this isn't entirely clear at first, it might be helpful to inspect a few input-response pairs with your LLM application.

Eyeballing LLM Responses

Although eyeballing LLM responses is not a scalable way to evaluate LLMs, it is nevertheless a great way to identify what makes a good response good but also, what makes a bad response bad.

Let's begin by examining these 5 input-response pairs between some hypothetical users and the medical diagnosis chatbot that we set up in the previous section. We'll assess whether each interaction meets our expectations or highlights potential issues, which will help us refine our evaluation criteria.

Input-Response Pair 1 ✅

The following interaction is an ideal example of how our medical chatbot should respond, because it asks clarifying questions before attempting to make a diagnosis. As a medical company, making accurate diagnoses is extremely important, so it's necessary to be asking these kinds of question before jumping to conclusions.

user_query = "Hey, I've been experiencing headaches lately."

llm_output = """I can help you with that. Could you please provide more specific details about
your headaches? For example, the location, intensity, duration, frequency, and
any other symptoms you might be experiencing alongside the headaches."""

Input-Response Pair 2 ❌

On the other hand, the following interaction is problematic. While the user is primarily seeking a diagnosis, the chatbot unnecessarily asks for the patient’s name and appointment details, before any sort of appointment is requested by the user.

user_query = "I'm been coughing a lot lately."

llm_output = "Could you please provide your name, the date for the appointment, and your email address?"

Input-Response Pair 3 ❌

For similar but opposite reasons, the following interaction is undesirable, namely because while the user is clearly trying to book an appointment, the chatbot responds tangentially by asking about the user’s symptoms, rather than addressing the appointment request.

user_query = "I'm feeling pretty sick, can I book an appointment tomorrow?"

llm_output = """I can help you with that. Could you please provide more
specific details about your symptoms? This will help us in making a more accurate diagnosis."""

While some may argue that the order of gathering appointment details and making a diagnosis isn’t critical—since both tasks will likely be completed—it’s essential for our medical chatbot to provide responses that are relevant and directly address the user’s query. This approach ensures the chatbot appears intelligent and authentic, building trust and encouraging patients to rely on it for their diagnostic needs.

Input-Response Pair 4 ✅

The following interaction is acceptable because the chatbot accurately records the user’s information, stores it in the appropriate fields, and presents it back to the user for confirmation.

user_query = """My name is John Doe. My email is johndoe@gmail.com, and I would
like to make an appointment on November 23, 2022."""

llm_output = """Please confirm the following details: Name: John Doe, Email: johndoe@gmail.com,
Date: November 23, 2022, Symptoms: Frequent coughing, Diagnosis: Frequent coughing
may indicate chronic bronchitis, especially if accompanied by phlegm, wheezing, or
shortness of breath.. Type 'confirm' to finalize or 'edit' to modify."""

Input-Response Pair 5 ❌

The following input-response pair is unacceptable due to its unprofessional tone, characterized by dismissiveness and a lack of empathy, and its failure to adequately address the user’s concern:

user_query = "My nose has been running constantly. Could it be something serious?"

llm_output = """It’s probably nothing to worry about. Most likely just allergies or
a cold. Don’t overthink it."""

Through inspecting just five example interactions with the medical chatbot—two acceptable and three problematic—we've discovered the following values we care about, and that we think users care about:

Direct engagement: Users expect the chatbot to address them directly, guiding the flow of the conversation in a way that feels natural and responsive.
Accuracy in diagnoses: Users depend on accurate responses that neither overdiagnose nor underdiagnose based on the symptoms provided.
Trustworthiness and Professionalism: Every interaction should reinforce the chatbot’s reliability, fostering confidence in its capability as a medical assistant.

Defining Your Evaluation Criteria

Based on these insights, we identified 3 specific areas where our LLM needs improvement:

Directly addressing the user: The chatbot should engage users personally and conversationally to build trust and maintain control of the interaction.
Providing accurate diagnoses: Diagnoses must be reliable and based on the provided symptoms, minimizing risks and errors.
Providing professional responses: Responses should be clear and respectful, adhering to the standards of professionalism expected in a medical setting.

important

These areas of improvement represent your evaluation criteria—standards for assessing each input-response pair.

Why These Criteria Matter

Each criterion reflects user expectations and our values:

Directly addressing the user ensures two essential outcomes: users feel in control of the interaction, and the chatbot is perceived as an intelligent, trustworthy assistant.
Providing accurate diagnoses safeguard against false positives and false negatives (underdiagnosis), both of which could erode trust and lead to severe consequences.
Providing professional responses ensures the chatbot's responses are respectful, empathetic, and aligned with the tone expected in a medical setting. This fosters user confidence and encourages reliance on the chatbot as a credible medical assistant.

By focusing on these criteria, we ensure our evaluation process aligns with both our values and user needs. With this foundation in place, we can now proceed to the next step: selecting metrics that effectively measure these key areas of performance.

Eyeballing LLM Responses​

Input-Response Pair 1 ✅​

Input-Response Pair 2 ❌​

Input-Response Pair 3 ❌​

Input-Response Pair 4 ✅​

Input-Response Pair 5 ❌​

Defining Your Evaluation Criteria​

Why These Criteria Matter​