Skip to main content

Faithfulness

The faithfulness metric measures the quality of your RAG pipeline's generator by evaluating whether the actual_output factually aligns with the contents of your retrieval_context. deepeval's faithfulness metric is a self-explaining LLM-Eval, meaning it outputs a reason for its metric score.

info

Although similar to the HallucinationMetric, the faithfulness metric in deepeval is more concerned with contradictions between the actual_output and retrieval_context in RAG pipelines, rather than hallucination in the actual LLM itself.

Required Arguments

To use the FaithfulnessMetric, you'll have to provide the following arguments when creating an LLMTestCase:

  • input
  • actual_output
  • retrieval_context

Example

from deepeval import evaluate
from deepeval.metrics import FaithfulnessMetric
from deepeval.test_case import LLMTestCase

# Replace this with the actual output from your LLM application
actual_output = "We offer a 30-day full refund at no extra cost."

# Replace this with the actual retrieved context from your RAG pipeline
retrieval_context = ["All customers are eligible for a 30 day full refund at no extra cost."]

metric = FaithfulnessMetric(
threshold=0.7,
model="gpt-4",
include_reason=True
)
test_case = LLMTestCase(
input="What if these shoes don't fit?",
actual_output=actual_output,
retrieval_context=retrieval_context
)

metric.measure(test_case)
print(metric.score)
print(metric.reason)

# or evaluate test cases in bulk
evaluate([test_case], [metric])

There are five optional parameters when creating a FaithfulnessMetric:

  • [Optional] threshold: a float representing the minimum passing threshold, defaulted to 0.5.
  • [Optional] model: a string specifying which of OpenAI's GPT models to use, OR any custom LLM model of type DeepEvalBaseLLM. Defaulted to 'gpt-4-turbo'.
  • [Optional] include_reason: a boolean which when set to True, will include a reason for its evaluation score. Defaulted to True.
  • [Optional] strict_mode: a boolean which when set to True, enforces a binary metric score: 1 for perfection, 0 otherwise. It also overrides the current threshold and sets it to 1. Defaulted to False.
  • [Optional] async_mode: a boolean which when set to True, enables concurrent execution within the measure() method. Defaulted to True.

How Is It Calculated?

The FaithfulnessMetric score is calculated according to the following equation:

Faithfulness=Number of Truthful ClaimsTotal Number of Claims\text{Faithfulness} = \frac{\text{Number of Truthful Claims}}{\text{Total Number of Claims}}

The FaithfulnessMetric first uses an LLM to extract all claims made in the actual_output, before using the same LLM to classify whether each claim is truthful based on the facts presented in the retrieval_context.

A claim is considered truthful if it does not contradict any facts presented in the retrieval_context.