Choosing the right Metrics for your QA Agent
To choose the right metrics, we'll need to revisit our evaluation criteria. In the previous section, we observed a few responses from our QA agent and established that all responses should be:
- Relevant to the user query
- Non-speculative (it shouldn't fabricate information when asked questions that require details not present in the knowledge base).
Having clear evaluation criteria helps you easy identify specific evaluation metrics that are relevant to your values and use case.
Choosing your Metrics
Our first criterion requires that the QA agent's answers be relevant. This makes the AnswerRelevancy
metric a straightforward choice, as it directly measures the answer relevance with respect to the user input, and is readily available in DeepEval
.
from deepeval.metrics import AnswerRelevancyMetric
AnswerRelevancy
and Faithfulness
are RAG metrics specifically designed for evaluating RAG systems. If you're not familiar with RAG metrics, this comprehensive guide is a must-read—especially if you're building QA agents.
To ensure that our answers are non-speculative, we'll need to ensure that the QA agent only include information from our knowledge base. Fortunately, Faithfulness
measures weather the generated output factually aligns with the information in the retrieved context, which also prevents our LLMs from producing speculative information.
from deepeval.metrics import FaithfulnessMetric
It's important to note that while your evaluation criteria align with RAG metrics, this is not always the case—even for QA agents. For example, as mentioned earlier, although not a priority, answers should sound more human. In such a case, defining a custom GEval
metric for Humanness might take precedence if it is prioritized over answer relevancy or faithfulness.
G-Eval is a metric in Deepeval that allows you to create any custom metric using a user-provided criteria. Learn more about GEval
here.
Defining these Metrics
In deepeval defining metric is as easy as importing them and initializing them:
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric
answer_relevancy_metric = AnswerRelevancy()
faithfulness_metric = Faithfulness()
DeepEval offers 20+ metrics out of the box. You can learn more about them here.
With our metrics defined, and evaluation dataset pushed to Confident AI, we're ready to begin running evaluations.