Conversational G-Eval
The conversationl G-Eval is an adopted version of deepeval
's popular GEval
metric but for evaluating entire conversations instead. It is currently the best way to define custom critera to evaluate multi-turn conversations in deepeval
. By defining a custom ConversationalGEval
, you can easily determine whether your LLM chatbot is able to consistently generate responses that are up to standard with your custom criteria throughout a conversation.
Required Arguments
To use the ConversationalGEval
metric, you'll have to provide the following arguments when creating an ConversationalTestCase
:
turns
Additionally, each LLMTestCase
s in turns
requires the following arguments:
input
actual_output
You'll also need to supply any additional arguments such as expected_output
and context
if your evaluation criteria depends on these parameters.
Example
To create a custom metric that evaluates entire LLM conversations, simply instantiate a ConversationalGEval
class and define an evaluation criteria in everyday language:
from deepeval.test_case import LLMTestCase, LLMTestCaseParams, ConversationalTestCase
from deepeval.metrics import ConversationalGEval
convo_test_case = ConversationalTestCase(
turns=[LLMTestCase(input="...", actual_output="...")]
)
professionalism_metric = ConversationalGEval(
name="Professionalism",
criteria="""Given the 'actual output' are generated responses from an
LLM chatbot and 'input' are user queries to the chatbot, determine whether
the chatbot has acted professionally throughout a conversation.""",
# NOTE: you can only provide either criteria or evaluation_steps, and not both
evaluation_steps=[
"Check whether each LLM 'actual output' is professional with regards to the user 'input'",
"Being professional means no profanity, no toxic language, and consistently says 'please' or 'thank you'.",
"Penalize heavily if exclaimation marks are used in a rude demeanour."
],
evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT],
)
metric.measure(convo_test_case)
print(metric.score)
print(metric.reason)
There are three mandatory and six optional parameters required when instantiating an ConversationalGEval
class:
name
: name of metric. This will not affect the evaluation.criteria
: a description outlining the specific evaluation aspects for each test case.evaluation_params
: a list of typeLLMTestCaseParams
. Include only the parameters that are relevant for evaluation.- [Optional]
evaluation_steps
: a list of strings outlining the exact steps the LLM should take for evaluation. Ifevaluation_steps
is not provided,ConversationalGEval
will generate a series ofevaluation_steps
on your behalf based on the providedcriteria
. You can only provide eitherevaluation_steps
ORcriteria
, and not both. - [Optional]
threshold
: the passing threshold, defaulted to 0.5. - [Optional]
model
: a string specifying which of OpenAI's GPT models to use, OR any custom LLM model of typeDeepEvalBaseLLM
. Defaulted to 'gpt-4o'. - [Optional]
strict_mode
: a boolean which when set toTrue
, enforces a binary metric score: 1 for perfection, 0 otherwise. It also overrides the current threshold and sets it to 1. Defaulted toFalse
. - [Optional]
async_mode
: a boolean which when set toTrue
, enables concurrent execution within themeasure()
method. Defaulted toTrue
. - [Optional]
verbose_mode
: a boolean which when set toTrue
, prints the intermediate steps used to calculate said metric to the console, as outlined in the How Is It Calculated section. Defaulted toFalse
.
For accurate and valid results, only test case parameters that are mentioned in criteria
/evaluation_steps
should be included as a member of evaluation_params
.
How Is It Calculated?
The ConversationalGEval
is an adapted version of GEval
, so alike GEval
, the ConversationalGEval
metric is a two-step algorithm that first generates a series of evaluation_steps
using chain of thoughts (CoTs) based on the given criteria
, before using the generated evaluation_steps
to determine the final score using the evaluation_params
presented in an LLMTestCase of each turn.
Unlike regular GEval
though, the ConversationalGEval
takes the entire conversation history into account during evaluation.
Similar to the original G-Eval paper, the ConversationalGEval
metric uses the probabilities of the LLM output tokens to normalize the score by calculating a weighted summation. This step was introduced in the paper to minimize bias in LLM scoring, and is automatically handled by deepeval
(unless you're using a custom LLM).