Skip to main content

Evaluating LLMs in Production

Quick Summary

In the previous section, we set up our medical chatbot for production monitoring and learned how to leverage Confident AI to view and filter responses and traces. Now, it's time to evaluate them. When it comes to evaluating LLMs in production, there are 2 key aspects to focus on:

Before we begin, first make sure you are logged in to Confident AI:

deepeval login

Online Evaluations

Think of online evaluations as your first line of defense in identifying bad or failing responses. While they may not be as valuable as human feedback, they can still be incredibly effective in flagging potentially failing test cases. These flagged cases can then be reviewed by human evaluators, as it’s simply not feasible to manually review every single response logged in production.

note

It's important to note that metrics in production are reference-less metrics, because they do not have ground truth references (expected output and context).

Human-in-the-Loop Feedback

Human feedback goes beyond domain experts or dedicated reviewers—it also includes direct input from your users. This kind of feedback is essential for refining your model's performance. We’ll discuss how to collect and leverage user feedback in greater detail in the following sections.

Setting up Online Evaluations

OpenAI API key

It's extremely simple to set up online evaluations on Confident AI. Simply navigate to the settings page and input your OPENAI_API_KEY. This allows Confident AI to generate evaluation scores using OpenAI models.

tip

While Confident AI uses OpenAI models by default, the platform fully supports custom models for online evaluations. For more information, feel free to contact us at support@confident-ai.com.

Langchain

Turn on your Metrics

Next, navigate to the Online Evaluations page and scroll down to view the list of available referenceless metrics. Here, you can toggle metrics on or off, adjust thresholds for each metric, and optionally enable strict mode.

info

You can also define custom metrics in production with a clear evaluation criteria or set of steps, provided they are referenceless (depends on input, actual_output, and/or retrieval_context)

Langchain

Once the metrics are enabled, all incoming responses will be evaluated automatically. In the next section, we’ll explore how to filter these responses based on metric scores and incorporate human-in-the-loop feedback.

Human-in-the-Loop Evaluation

Metric-based Filtering

Notice that in the previous step, we toggled the following metrics: Answer Relevancy, Faithfulness, Bias, and Contextual Relevancy. Let's say we're trying to evaluate how our retriever (RAG engine tool) is performing in production. We'll need to look at all the responses that didn't pass the 0.5 threshold for Contextual Relevancy.

info

The observatory allows you to easily filter for failing responses based on metric scores, as well as default properties, hyperparameters, and any custom data. Visit Confident AI to try it out now.

Langchain

We'll examine this specific response, where our medical chatbot retrieved some information about coughing and pollution and diagnosed the user's condition as a respiratory infection, possibly bronchitis. To understand why contextual relevancy failed despite this response passing every other metric, we'll click inspect to take a closer look.

Langchain

Inspecting Metric Scores

Navigate to the Metrics tab in the side panel to view the metric scores in detail. As with any metric in DeepEval, online evaluation metrics are supported by reasoning and detailed logs, enabling anyone reviewing these responses to easily understand why a specific metric is failing and trace the steps through the score calculations.

Langchain

Scrolling down to Contextual Relevancy, we see that while the retrieved context is somewhat relevant (scoring 0.45), much of it discusses unrelated topics, primarily focusing on environmental problems, potentially causing the LLM to generate an unclear diagnosis from the lack of relevant information.

info

Whether this contextual relevancy failure indicates a need to reduce chunk_size, increasing the number of nodes in the retrieval context, or expanding your knowledge base, it's crucial to track this response. We'll achieve this by leaving feedback.

Langchain

Leaving Human Feedback

For each response, you'll find an option to leave feedback above the various sub-tabs. For this particular response, let's assign a rating of 2 stars, citing the lack of comprehensive context leading to an unclear diagnosis. However, the answer remains relevant, unbiased, and faithful.

tip

You may optionally provide an expected response, which can be helpful if you plan to add this response to a dataset for further evaluation and testing. We'll explore how to add responses to a dataset in later sections.

Langchain

You can also leave feedback on entire conversations instead of individual responses. To do this, click on the conversation ID you’re interested in and click Leave Feedback, where you'll encounter a familiar interface. As with individual responses, you can also add these monitored conversations to datasets.

Langchain

Inspecting Human Feedback

All feedback, whether individual or conversational, can be accessed on the Human Feedback page. Here, you can filter feedback based on various criteria such as provider, rating, expected response, and more. To add responses to a dataset, simply check the relevant feedback, go to actions, and click add response to dataset.

info

These feedback-based filters are also available in the Observatory, meaning you can filter for responses based on various human feedback variables.

Feedback Page

We've created a dataset called Failing Responses, specifically designed for collecting feedback on failing responses. This is the same evaluation dataset referenced in the Preparing Your Evaluation Dataset section. By aggregating these failing responses, you can further refine and improve your LLM's performance in future testing environments.

tip

It may be helpful to categorize different types of failing feedback into separate datasets (e.g., one for retriever failures, another for LLM failures, etc.).

Failing Responses Dataset

User Provided Feedback

In addition to leaving feedback from the developer's side, you can also set up your LLM to receive user feedback with just one line of code. Here's how to set it up:

import deepeval

response_id = deepeval.monitor(...)

deepeval.send_feedback(
response_id=response_id,
rating=5,
explanation="...",
expected_response="..."
)

For example, in our medical chatbot, you might want to ask users if they were satisfied with the service. While this feedback may not be as detailed as that provided by your team, it is far more abundant and can offer insights into the areas of your LLM that drive the most value. Here’s how the code might look:

import deepeval
import time

class MedicalAppointmentSystem():
...
def interactive_session(self):
print("Welcome to the Medical Diagnosis and Booking System!")
print("Please enter your symptoms or ask about appointment details.")

while True:
user_input = input("Your query: ")

if user_input.lower() == 'exit':
# Gather user feedback before exiting
try:
satisfied = input("Were you satisfied with the response? (yes/no): ").strip().lower()
if satisfied not in ['yes', 'no']:
print("Invalid input. Assuming 'no'.")
satisfied = 'no'
rating = 5 if satisfied == 'yes'
explanation = input("Optional: Please provide an explanation (or press Enter to skip): ").strip()

deepeval.send_feedback(
response_id="place_holder_response_id",
rating=rating,
explanation=explanation if explanation else "No explanation provided.",
)
print("Thank you for your feedback!")

except Exception as e:
print(f"An error occurred while sending feedback: {e}")
break
...
info

Balancing user satisfaction with the level of detail in feedback is essential. For instance, while we provide a rating scale from 1 to 5, we simplify it into a binary option: whether the user was satisfied or not.