Skip to main content

Choosing the right Metrics for your QA Agent

To choose the right metrics, we'll need to revisit our evaluation criteria. In the previous section, we observed a few responses from our QA agent and established that all responses should be:

  1. Relevant to the user query
  2. Non-speculative (it shouldn't fabricate information when asked questions that require details not present in the knowledge base).
tip

Having clear evaluation criteria helps you easy identify specific evaluation metrics that are relevant to your values and use case.

Choosing your Metrics

Our first criterion requires that the QA agent's answers be relevant. This makes the AnswerRelevancy metric a straightforward choice, as it directly measures the answer relevance with respect to the user input, and is readily available in DeepEval.

from deepeval.metrics import AnswerRelevancyMetric
info

AnswerRelevancy and Faithfulness are RAG metrics specifically designed for evaluating RAG systems. If you're not familiar with RAG metrics, this comprehensive guide is a must-read—especially if you're building QA agents.

To ensure that our answers are non-speculative, we'll need to ensure that the QA agent only include information from our knowledge base. Fortunately, Faithfulness measures weather the generated output factually aligns with the information in the retrieved context, which also prevents our LLMs from producing speculative information.

from deepeval.metrics import FaithfulnessMetric

It's important to note that while your evaluation criteria align with RAG metrics, this is not always the case—even for QA agents. For example, as mentioned earlier, although not a priority, answers should sound more human. In such a case, defining a custom GEval metric for Humanness might take precedence if it is prioritized over answer relevancy or faithfulness.

note

G-Eval is a metric in Deepeval that allows you to create any custom metric using a user-provided criteria. Learn more about GEval here.

Defining these Metrics

In deepeval defining metric is as easy as importing them and initializing them:

from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric

answer_relevancy_metric = AnswerRelevancy()
faithfulness_metric = Faithfulness()
tip

DeepEval offers 20+ metrics out of the box. You can learn more about them here.

With our metrics defined, and evaluation dataset pushed to Confident AI, we're ready to begin running evaluations.