Skip to main content

Selecting Your Metrics

Once you have a clearly defined evaluation criteria, selecting metrics becomes significantly easier. In some cases, you may find existing metrics in DeepEval that already match your criteria. In others, you'll need to create custom metrics to address your unique evaluation needs.

tip

DeepEval provides 14+ metrics to help you evaluate your LLM. Familiarizing yourself with these metrics can help you choose the ones that best align with your evaluation criteria.

Selecting Metrics Relevant To Your Criteria

In this section, we’ll be selecting the LLM evaluation metrics for our medical chatbot based on the evaluation criteria we've established in the previous section. Let’s quickly revisit these criteria:

  1. Directly addressing the user: The chatbot should directly address users' requests
  2. Providing accurate diagnoses: Diagnoses must be reliable and based on the provided symptoms
  3. Providing professional responses: Responses should be clear and respectful

Answer Relevancy

Let's start with our first metric, which will evaluate our medical chatbot against our first criterion:

Criteria 1: The medical chatbot should address the user directly.

Currently, our chatbot sometimes fails to directly address user queries, instead taking the lead in the conversation—for example, asking for appointment details instead of focusing on diagnosing the patient. This results in responses that only tangentially address the user's input. To address this, we should be evaluating how relevant the chatbot's responses are to the user query.

To address this, you can leverage deepeval's default AnswerRelevancyMetric, which is available out-of-the-box and evaluates how relevant an LLM's output is to the input.

info

The AnswerRelevancyMetric uses an LLM to extract all statements from the actual_output and then classifies each statement's relevance to the input using the same LLM. You can read more on how each individual default metric is calculated by visiting their individual metric pages.

Faithfulness

Our next metric addresses the inaccuracies in patient diagnoses. The chatbot's failure to deliver accurate diagnoses in some example interactions suggests that our RAG tool needs improvement.

Criteria 2: The chatbot should provide accurate diagnoses based on the given symptoms.

This is because the RAG engine is responsible for retrieving relevant medical information from our knowledge base to support patient diagnoses. To address this, we need to evaluate specifically whether the information in the retrieved chunks actually align with the information in the actual output.

deepeval's FaithfulnessMetric is well-suited for this task. It assesses the whether the actual_output factually aligns with the contents of the retrieval_context.

tip

deepeval offers a total of 5 RAG metrics to evaluate your RAG pipeline. To learn more about selecting the right metrics for your use case, check out this in-depth guide on RAG evaluation.

Professionalism

Our final metric will address Criterion 3, focusing on evaluating our chatbot's professionalism.

Criterion 3: The chatbot should provide clear, respectful, and professional responses.

Since deepeval doesn't natively support this evaluation criteria, we'll need to define our own custom Professionalism metric using deepeval's custom metric G-Eval, and that's OK. Defining custom metrics is nothing to be afriad of, and while deepeval offers a tons of default metrics that are ready to use out-of-the-box there are often times more use case specific definitions of a metric that requires more customization.

The professionalism metric here is a great example - what it means to be professional in one work setting can be drastically different from another and in our case, the custom professionalism metric we define will allow us to ensure that the chatbot maintains a professional tone typically expected in a medical setting.

note

G-Eval is a custom metric framework that enables users to leverage LLMs for evaluating outputs based on their own tailored evaluation criteria.

Now that we've selected our three metrics, let's see how to implement them in code.

Defining Metrics in DeepEval

To define our Answer Relevancy, Contextual Relevancy, and custom G-Eval metric for professionalism, you'll first need to install DeepEval. Run the following command in your CLI:

pip install deepeval

Defining Default Metrics

Let's begin by defining the Answer Relevancy and Contextual Relevancy metrics, which is as simple as importing and instantiating their respective classes.

from deepeval.metrics import (
AnswerRelevancyMetric,
ContextualRelevancyMetric
)

answer_relevancy_metric = AnswerRelevancyMetric()
contextual_relevancy_metric = ContextualRelevancyMetric()

Defining a Custom Metric

Next, we'll define our custom G-Eval metric for professionalism. This involves specifying the name of the metric, the evaluation criteria, and the parameters to evaluate. In this case, we're only assessing the LLM's actual_output.

from deepeval.test_case_import import LLMTestCaseParams
from deepeval.metrics import GEval

# Define criteria for evaluating professionalism
criteria = """Determine whether the actual output demonstrates professionalism by being
clear, respectful, and maintaining an empathetic tone consistent with medical interactions."""

# Create a GEval metric for professionalism
professionalism_metric = GEval(
name="Professionalism",
criteria=criteria,
evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT]
)
info

G-Eval is a two-step algorithm that first uses chain-of-thought reasoning (CoTs) to generate a series of evaluation steps based on the specified criteria. It then applies these steps to assess the parameters provided in an LLMTestCase and calculate the final score.

With the evaluation criteria defined and metrics selected, we can finally begin running evaluations in the following section.