Evaluating LLMs in Production
Quick Summary
In the previous section, we set up our medical chatbot for production monitoring and learned how to leverage Confident AI to view and filter responses and traces. Now, it's time to evaluate them. When it comes to evaluating LLMs in production, there are 2 key aspects to focus on:
Before we begin, first make sure you are logged in to Confident AI:
deepeval login
Online Evaluations
Think of online evaluations as your first line of defense in identifying bad or failing responses. While they may not be as valuable as human feedback, they can still be incredibly effective in flagging potentially failing test cases. These flagged cases can then be reviewed by human evaluators, as it’s simply not feasible to manually review every single response logged in production.
It's important to note that metrics in production are reference-less metrics, because they do not have ground truth references (expected output and context).
Human-in-the-Loop Feedback
Human feedback goes beyond domain experts or dedicated reviewers—it also includes direct input from your users. This kind of feedback is essential for refining your model's performance. We’ll discuss how to collect and leverage user feedback in greater detail in the following sections.
Setting up Online Evaluations
OpenAI API key
It's extremely simple to set up online evaluations on Confident AI. Simply navigate to the settings page and input your OPENAI_API_KEY
. This allows Confident AI to generate evaluation scores using OpenAI models.
While Confident AI uses OpenAI models by default, the platform fully supports custom models for online evaluations. For more information, feel free to contact us at support@confident-ai.com.
Turn on your Metrics
Next, navigate to the Online Evaluations page and scroll down to view the list of available referenceless metrics. Here, you can toggle metrics on or off, adjust thresholds for each metric, and optionally enable strict mode.
You can also define custom metrics in production with a clear evaluation criteria or set of steps, provided they are referenceless (depends on input
, actual_output
, and/or retrieval_context
)
Once the metrics are enabled, all incoming responses will be evaluated automatically. In the next section, we’ll explore how to filter these responses based on metric scores and incorporate human-in-the-loop feedback.
Human-in-the-Loop Evaluation
Metric-based Filtering
Notice that in the previous step, we toggled the following metrics: Answer Relevancy, Faithfulness, Bias, and Contextual Relevancy. Let's say we're trying to evaluate how our retriever (RAG engine tool) is performing in production. We'll need to look at all the responses that didn't pass the 0.5 threshold for Contextual Relevancy.
The observatory allows you to easily filter for failing responses based on metric scores, as well as default properties, hyperparameters, and any custom data. Visit Confident AI to try it out now.
We'll examine this specific response, where our medical chatbot retrieved some information about coughing and pollution and diagnosed the user's condition as a respiratory infection, possibly bronchitis. To understand why contextual relevancy failed despite this response passing every other metric, we'll click inspect to take a closer look.
Inspecting Metric Scores
Navigate to the Metrics tab in the side panel to view the metric scores in detail. As with any metric in DeepEval, online evaluation metrics are supported by reasoning and detailed logs, enabling anyone reviewing these responses to easily understand why a specific metric is failing and trace the steps through the score calculations.
Scrolling down to Contextual Relevancy, we see that while the retrieved context is somewhat relevant (scoring 0.45), much of it discusses unrelated topics, primarily focusing on environmental problems, potentially causing the LLM to generate an unclear diagnosis from the lack of relevant information.
Whether this contextual relevancy failure indicates a need to reduce chunk_size
, increasing the number of nodes in the retrieval context, or expanding your knowledge base, it's crucial to track this response. We'll achieve this by leaving feedback.
Leaving Human Feedback
For each response, you'll find an option to leave feedback above the various sub-tabs. For this particular response, let's assign a rating of 2 stars, citing the lack of comprehensive context leading to an unclear diagnosis. However, the answer remains relevant, unbiased, and faithful.
You may optionally provide an expected response, which can be helpful if you plan to add this response to a dataset for further evaluation and testing. We'll explore how to add responses to a dataset in later sections.
You can also leave feedback on entire conversations instead of individual responses. To do this, click on the conversation ID you’re interested in and click Leave Feedback, where you'll encounter a familiar interface. As with individual responses, you can also add these monitored conversations to datasets.
Inspecting Human Feedback
All feedback, whether individual or conversational, can be accessed on the Human Feedback page. Here, you can filter feedback based on various criteria such as provider, rating, expected response, and more. To add responses to a dataset, simply check the relevant feedback, go to actions, and click add response to dataset.
These feedback-based filters are also available in the Observatory, meaning you can filter for responses based on various human feedback variables.
We've created a dataset called Failing Responses, specifically designed for collecting feedback on failing responses. This is the same evaluation dataset referenced in the Preparing Your Evaluation Dataset section. By aggregating these failing responses, you can further refine and improve your LLM's performance in future testing environments.
It may be helpful to categorize different types of failing feedback into separate datasets (e.g., one for retriever failures, another for LLM failures, etc.).
User Provided Feedback
In addition to leaving feedback from the developer's side, you can also set up your LLM to receive user feedback with just one line of code. Here's how to set it up:
import deepeval
response_id = deepeval.monitor(...)
deepeval.send_feedback(
response_id=response_id,
rating=5,
explanation="...",
expected_response="..."
)
For example, in our medical chatbot, you might want to ask users if they were satisfied with the service. While this feedback may not be as detailed as that provided by your team, it is far more abundant and can offer insights into the areas of your LLM that drive the most value. Here’s how the code might look:
import deepeval
import time
class MedicalAppointmentSystem():
...
def interactive_session(self):
print("Welcome to the Medical Diagnosis and Booking System!")
print("Please enter your symptoms or ask about appointment details.")
while True:
user_input = input("Your query: ")
if user_input.lower() == 'exit':
# Gather user feedback before exiting
try:
satisfied = input("Were you satisfied with the response? (yes/no): ").strip().lower()
if satisfied not in ['yes', 'no']:
print("Invalid input. Assuming 'no'.")
satisfied = 'no'
rating = 5 if satisfied == 'yes'
explanation = input("Optional: Please provide an explanation (or press Enter to skip): ").strip()
deepeval.send_feedback(
response_id="place_holder_response_id",
rating=rating,
explanation=explanation if explanation else "No explanation provided.",
)
print("Thank you for your feedback!")
except Exception as e:
print(f"An error occurred while sending feedback: {e}")
break
...
Balancing user satisfaction with the level of detail in feedback is essential. For instance, while we provide a rating scale from 1 to 5, we simplify it into a binary option: whether the user was satisfied or not.