Text to Image
The Text to Image metric assesses the performance of image generation tasks by evaluating the quality of synthesized images based on semantic consistency and perceptual quality. deepeval
's Text to Image metric is a self-explaining MLLM-Eval, meaning it outputs a reason for its metric score.
The Text to Image metric achieves scores comparable to human evaluations when GPT-4v is used as the evaluation model. This metric excels in artifact detection.
Required Arguments
To use the TextToImageMetric
, you'll have to provide the following arguments when creating an MLLMTestCase
:
input
actual_output
The input should contain exactly 0 images, and the output should contain exactly 1 image.
Example
from deepeval import evaluate
from deepeval.metrics import TextToImageMetric
from deepeval.test_case import MLLMTestCase, MLLMImage
# Replace this with your actual MLLM application output
actual_output=[MLLMImage(url="https://shoe-images.com/edited-shoes", local=False)]
metric = TextToImageMetric(
threshold=0.7,
include_reason=True,
)
test_case = MLLMTestCase(
input=["Generate an image of a blue pair of shoes."],
actual_output=actual_output,
retrieval_context=retrieval_context
)
metric.measure(test_case)
print(metric.score)
print(metric.reason)
# or evaluate test cases in bulk
evaluate([test_case], [metric])
There are five optional parameters when creating a TextToImageMetric
:
- [Optional]
threshold
: a float representing the minimum passing threshold, defaulted to 0.5. - [Optional]
include_reason
: a boolean which when set toTrue
, will include a reason for its evaluation score. Defaulted toTrue
. - [Optional]
strict_mode
: a boolean which when set toTrue
, enforces a binary metric score: 1 for perfection, 0 otherwise. It also overrides the current threshold and sets it to 1. Defaulted toFalse
. - [Optional]
async_mode
: a boolean which when set toTrue
, enables concurrent execution within themeasure()
method. Defaulted toTrue
. - [Optional]
verbose_mode
: a boolean which when set toTrue
, prints the intermediate steps used to calculate said metric to the console, as outlined in the How Is It Calculated section. Defaulted toFalse
.
How Is It Calculated?
The TextToImageMetric
score is calculated according to the following equation:
The TextToImageMetric
score combines Semantic Consistency (SC) and Perceptual Quality (PQ) sub-scores to provide a comprehensive evaluation of the synthesized image. The final overall score is derived by taking the square root of the product of the minimum SC and PQ scores.
SC Scores
These scores assess aspects such as alignment with the prompt and resemblance to concepts. The minimum value among these sub-scores represents the SC score. During the SC evaluation, both the input conditions and the synthesized image are used.
PQ Scores
These scores evaluate the naturalness and absence of artifacts in the image. The minimum value among these sub-scores represents the PQ score. For the PQ evaluation, only the synthesized image is used to prevent confusion from the input conditions.