Using the RAG Triad for RAG evaluation
Retrieval-Augmented Generation (RAG) is a powerful way for LLMs to generate responses based on context beyond the scope of its training data by supplying it with external data as additional context. These supporting context comes in the form of text chunks, which are usually parsed, vectorized, and indexed in vector databases for fast retrieval at inference time, hence the name retrieval, augmented, generation.
In a previous guide, we explored how the generator in a RAG pipeline can hallucinate despite being supplied additional context, while the retriever can often fail to retrieve the correct and relevant context to generate the optimal answer. This is why evaluating RAG pipelines are important and where the RAG triad comes into play.
What is the RAG Triad?
The RAG triad is composed of three RAG evaluation metrics: answer relevancy, faithfulness, and contextual relevancy. If a RAG pipeline scores high on all three metrics, we can confidently say that our RAG pipeline is using the optimal hyperparameters. This is because each metric in the RAG triad corresponds to a certain hyperparameter in the RAG pipeline. For instance:
Answer relevancy: the answer relevancy metric determines how relevant the answers generated by your RAG generator is. Since LLMs nowadays are getting pretty good at reasoning, it is mainly the prompt template hyperparameter instead of the LLM you are iterating on when working with the answer relevancy metric. To be more specific, a low answer relevnacy score signifies that you need to improve examples used in prompt templates for better in-context learning, or include more fine-grained prompting for better instruction following capabilities to generate more relevant responses.
Faithfulness: the faithfulness metric determines how much the answers generated by your RAG generator are hallucinations. This concerns the LLM hyperparameter, and you'll want to switch to a different LLM or even fine-tune your own if your LLM is unable to leverage the retrieval context supplied to it to generate grounded answers.
infoYou might also see the faithfulness metric called groundedness instead in other places. They are 100% the same thing but just named differently.
Contextual Relevancy: the contextual relevancy metric determines whether the text chunks retrieved by your RAG retriever are relevant to producing the ideal answer for a user input. This concerns the chunk size, top-K and embedding model hyperparameter. A good embedding model ensures you're able to retrieve text chunks that are semantically similar to the embedded user query, while a good combination of chunk size and top-K ensures you only select the most important bits of information in your knowledge base.
You might have noticed we didn't mention the contextual precision and contextual recall metric. For those wondering, this is because contextual precision and recall requires a labelled expected answer (i.e. the ideal answer to a user input) which may not be possible for everyone, which is why this guide serves as full referenceless RAG evaluation guide.
Using the RAG Triad in DeepEval
Using the RAG triad of metrics in deepeval
is as simple as writing a few lines of code. First, create a test case to represent a user query, retrieved text chunks, and an LLM response:
from deepeval.test_case import LLMTestCase
test_case = LLMTestCase(input="...", actual_output="...", retrieval_context=["..."])
Here, input
is the user query, actual_output
is the LLM generated response, and retrieval_context
is a list of strings representing the retrieved text chunks. Then, define the RAG triad metrics:
from deepeval.metrics import AnswerRelevancy, Faithfulness, ContextualRelevancyMetric
...
answer_relevancy = AnswerRelevancy()
faithfulness = Faithfulness()
contextual_relevancy = ContextualRelevancyMetric()
You can find how these metrics are implemented and calculated on their respective documentation pages:
Lastly, evaluate your test case using these metrics:
from deepeval import evaluate
...
evaluate(test_cases=[test_case], metrics=[answer_relevancy, faithfulness, contextual_relevancy])
Congratulations 🎉! You've learnt everything you need to know for the RAG triad.
Scaling RAG Evaluation
As you scale up your RAG evaluation efforts, you can simply supply more test cases to the list of test_cases
in the evaluate()
function and more importantly, you can also generate synthetic datasets using deepeval
to test your RAG application at scale.