Selecting Your Metrics
Once you have a clearly defined evaluation criteria, selecting metrics becomes significantly easier. In some cases, you may find existing metrics in DeepEval that already match your criteria. In others, you'll need to create custom metrics to address your unique evaluation needs.
DeepEval provides 14+ metrics to help you evaluate your LLM. Familiarizing yourself with these metrics can help you choose the ones that best align with your evaluation criteria.
Selecting Metrics Relevant To Your Criteria
In this section, we’ll be selecting the LLM evaluation metrics for our medical chatbot based on the evaluation criteria we've established in the previous section. Let’s quickly revisit these criteria:
- Directly addressing the user: The chatbot should directly address users' requests
- Providing accurate diagnoses: Diagnoses must be reliable and based on the provided symptoms
- Providing professional responses: Responses should be clear and respectful
Answer Relevancy
Let's start with our first metric, which will evaluate our medical chatbot against our first criterion:
Criteria 1: The medical chatbot should address the user directly.
Currently, our chatbot sometimes fails to directly address user queries, instead taking the lead in the conversation—for example, asking for appointment details instead of focusing on diagnosing the patient. This results in responses that only tangentially address the user's input. To address this, we should be evaluating how relevant the chatbot's responses are to the user query.
To address this, you can leverage deepeval
's default AnswerRelevancyMetric
, which is available out-of-the-box and evaluates how relevant an LLM's output is to the input.
The AnswerRelevancyMetric
uses an LLM to extract all statements from the actual_output
and then classifies each statement's relevance to the input
using the same LLM. You can read more on how each individual default metric is calculated by visiting their individual metric pages.
Faithfulness
Our next metric addresses the inaccuracies in patient diagnoses. The chatbot's failure to deliver accurate diagnoses in some example interactions suggests that our RAG tool needs improvement.
Criteria 2: The chatbot should provide accurate diagnoses based on the given symptoms.
This is because the RAG engine is responsible for retrieving relevant medical information from our knowledge base to support patient diagnoses. To address this, we need to evaluate specifically whether the information in the retrieved chunks actually align with the information in the actual output.
deepeval
's FaithfulnessMetric
is well-suited for this task. It assesses the whether the actual_output
factually aligns with the contents of the retrieval_context
.
deepeval
offers a total of 5 RAG metrics to evaluate your RAG pipeline. To learn more about selecting the right metrics for your use case, check out this in-depth guide on RAG evaluation.
Professionalism
Our final metric will address Criterion 3, focusing on evaluating our chatbot's professionalism.
Criterion 3: The chatbot should provide clear, respectful, and professional responses.
Since deepeval
doesn't natively support this evaluation criteria, we'll need to define our own custom Professionalism
metric using deepeval
's custom metric G-Eval
, and that's OK. Defining custom metrics is nothing to be afriad of, and while deepeval
offers a tons of default metrics that are ready to use out-of-the-box there are often times more use case specific definitions of a metric that requires more customization.
The professionalism metric here is a great example - what it means to be professional in one work setting can be drastically different from another and in our case, the custom professionalism metric we define will allow us to ensure that the chatbot maintains a professional tone typically expected in a medical setting.
G-Eval is a custom metric framework that enables users to leverage LLMs for evaluating outputs based on their own tailored evaluation criteria.
Now that we've selected our three metrics, let's see how to implement them in code.
Defining Metrics in DeepEval
To define our Answer Relevancy, Contextual Relevancy, and custom G-Eval metric for professionalism, you'll first need to install DeepEval. Run the following command in your CLI:
pip install deepeval
Defining Default Metrics
Let's begin by defining the Answer Relevancy and Contextual Relevancy metrics, which is as simple as importing and instantiating their respective classes.
from deepeval.metrics import (
AnswerRelevancyMetric,
ContextualRelevancyMetric
)
answer_relevancy_metric = AnswerRelevancyMetric()
contextual_relevancy_metric = ContextualRelevancyMetric()
Defining a Custom Metric
Next, we'll define our custom G-Eval metric for professionalism. This involves specifying the name of the metric, the evaluation criteria, and the parameters to evaluate. In this case, we're only assessing the LLM's actual_output
.
from deepeval.test_case_import import LLMTestCaseParams
from deepeval.metrics import GEval
# Define criteria for evaluating professionalism
criteria = """Determine whether the actual output demonstrates professionalism by being
clear, respectful, and maintaining an empathetic tone consistent with medical interactions."""
# Create a GEval metric for professionalism
professionalism_metric = GEval(
name="Professionalism",
criteria=criteria,
evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT]
)
G-Eval is a two-step algorithm that first uses chain-of-thought reasoning (CoTs) to generate a series of evaluation steps based on the specified criteria
. It then applies these steps to assess the parameters provided in an LLMTestCase
and calculate the final score.
With the evaluation criteria defined and metrics selected, we can finally begin running evaluations in the following section.