Introduction to RAG QA Agent Evaluation

In this tutorial, we'll be showing you how to set up a comprehensive RAG QA Agent evaluation pipeline in just a few minutes. While this example focuses on a QA Agent, the concepts and guides presented in this tutorial are important for anyone building RAG systems.

note

Before we begin, you'll first need to login to Confident AI, where we'll be analyzing our evaluation reports and building our datasets. To do so, run:

deepeval login

We'll be covering everything from generating large synthetic datasets to running evaluations on your QA agent. More specifically, you'll be learning:

How to generate a synthetic dataset from your knowledge base
How to define an evaluation criteria for your QA agent
How to choose the right metrics for evaluating your QA RAG agent
How to pull your dataset to run evaluations
How to leverage deepeval to run evaluations and generate test reports
How to iterate on your QA agent's hyperparameters to improve generation quality
How to to catch regressions in your systems from hyperparameter changes

Establish the QA Agent

In this tutorial, we'll be evaluating a QA Agent designed to answer questions about MadeUpCompany, a company specializing in data analytics solutions. This QA Agent is a RAG (Retrieval-Augmented Generation) system, meaning it retrieves relevant information from a knowledge base whenever a user submits a query.

The goal of the QA Agent is to provide relevant and factually correct answers to help users better understand MadeUpCompany's products and services, and keep them sastisfied.

info

In this tutorial, we'll focus on 3 hyperparameters. In practice, you may want to experiment with additional hyperparameters depending on the complexity of your system, but the core approach remains the same.

Here are the 3 hyperparameters: we'll be using gpt-3.5 to power our QA Agent, with a top-k value of 3 for our retriever. Additionally, we'll use the following prompt template:

prompt_template = """You are a helpful QA Agent designed to answer user questions
about a company's products and services. Your goal is to provide accurate, relevant,
and well-structured responses based on the information retrieved from the company's
knowledge base.
"""

Unlike other LLM systems, it's much easier to build an evaluation dataset for QA Agents because of the availability of a knowledge base. Synthetic data generation techniques makes it easy possibe to generate a large high quality evaluation dataset in little time. We'll be exploring how to do this through DeepEval in the next section.