Multimodal Contextual Precision
The multimodal contextual precision metric measures your RAG pipeline's retriever by evaluating whether nodes in your retrieval_context
that are relevant to the given input
are ranked higher than irrelevant ones. deepeval
's multimodal contextual precision metric is a self-explaining MLLM-Eval, meaning it outputs a reason for its metric score.
The Multimodal Contextual Precision is the multimodal adaptation of DeepEval's contextual precision metric. It accepts images in addition to text for the input
, retrieval_context
, and expected_output
.
Required Arguments
To use the MultimodalContextualPrecisionMetric
, you'll have to provide the following arguments when creating a MLLMTestCase
:
input
actual_output
expected_output
retrieval_context
Example
from deepeval import evaluate
from deepeval.metrics import MultimodalContextualPrecisionMetric
from deepeval.test_case import MLLMTestCase, MLLMImage
metric = MultimodalContextualPrecisionMetric()
test_case = MLLMTestCase(
input=["Tell me about some landmarks in France"],
actual_output=[
"France is home to iconic landmarks like the Eiffel Tower in Paris.",
MLLMImage(...)
],
expected_output=[
"The Eiffel Tower is located in Paris, France.",
MLLMImage(...)
],
retrieval_context=[
MLLMImage(...),
"The Eiffel Tower is a wrought-iron lattice tower built in the late 19th century.",
MLLMImage(...)
],
)
metric.measure(test_case)
print(metric.score)
print(metric.reason)
# or evaluate test cases in bulk
evaluate([test_case], [metric])
There are six optional parameters when creating a MultimodalContextualPrecisionMetric
:
- [Optional]
threshold
: a float representing the minimum passing threshold, defaulted to 0.5. - [Optional]
model
: a string specifying which of OpenAI's Multimodal GPT models to use, OR any custom MLLM model of typeDeepEvalBaseMLLM
. Defaulted to 'gpt-4o'. - [Optional]
include_reason
: a boolean which when set toTrue
, will include a reason for its evaluation score. Defaulted toTrue
. - [Optional]
strict_mode
: a boolean which when set toTrue
, enforces a binary metric score: 1 for perfection, 0 otherwise. It also overrides the current threshold and sets it to 1. Defaulted toFalse
. - [Optional]
async_mode
: a boolean which when set toTrue
, enables concurrent execution within themeasure()
method. Defaulted toTrue
. - [Optional]
verbose_mode
: a boolean which when set toTrue
, prints the intermediate steps used to calculate said metric to the console, as outlined in the How Is It Calculated section. Defaulted toFalse
.
How Is It Calculated?
The MultimodalContextualPrecisionMetric
score is calculated according to the following equation:
- k is the (i+1)th node in the
retrieval_context
- n is the length of the
retrieval_context
- rk is the binary relevance for the kth node in the
retrieval_context
. rk = 1 for nodes that are relevant, 0 if not.
The MultimodalContextualPrecisionMetric
first uses an MLLM to determine for each node in the retrieval_context
whether it is relevant to the input
based on information in the expected_output
, before calculating the weighted cumulative precision as the contextual precision score. The weighted cumulative precision (WCP) is used because it:
- Emphasizes on Top Results: WCP places a stronger emphasis on the relevance of top-ranked results. This emphasis is important because MLLMs tend to give more attention to earlier nodes in the
retrieval_context
(which may cause downstream hallucination if nodes are ranked incorrectly). - Rewards Relevant Ordering: WCP can handle varying degrees of relevance (e.g., "highly relevant", "somewhat relevant", "not relevant"). This is in contrast to metrics like precision, which treats all retrieved nodes as equally important.
A higher multimodal contextual precision score represents a greater ability of the retrieval system to correctly rank relevant nodes higher in the retrieval_context
.