Deep Acyclic Graph
The deep acyclic graph (DAG) metric in deepeval
is a custom metric for you to easily build deterministic decision trees for evaluation with the help of using LLM as a judge.
Tge DAGMetric
is a custom, LLM-powered decision tree metric, and gives you more deterministic control over GEval
.
Required Arguments
To use the DAGMetric
, you'll have to provide the following arguments when creating an LLMTestCase
:
input
actual_output
You'll also need to supply any additional arguments such as expected_output
and tools_called
if your evaluation criteria depends on these parameters.
Complete Walkthrough
In this walkthough, we'll write a custom DAGMetric
to see whether our LLM application has summarized meeting transcripts in the correct format. Let's say here are our criteria, in plain english:
- The summary of meeting transcripts should contain the "intro", "body", and "conclusion" headings.
- The summary of meeting transcripts should present the "into", "body", and "conclusion" headings in the correct order.
Here's the example LLMTestCase
representing the transcript to be evaluated for formatting correctness:
from deepeval.test_case import LLMTestCase
test_case = LLMTestCase(
input="""
Alice: "Today's agenda: product update, blockers, and marketing timeline. Bob, updates?"
Bob: "Core features are done, but we're optimizing performance for large datasets. Fixes by Friday, testing next week."
Alice: "Charlie, does this timeline work for marketing?"
Charlie: "We need finalized messaging by Monday."
Alice: "Bob, can we provide a stable version by then?"
Bob: "Yes, we’ll share an early build."
Charlie: "Great, we'll start preparing assets."
Alice: "Plan: fixes by Friday, marketing prep Monday, sync next Wednesday. Thanks, everyone!"
"""
actual_output="""
Intro:
Alice outlined the agenda: product updates, blockers, and marketing alignment.
Body:
Bob reported performance issues being optimized, with fixes expected by Friday. Charlie requested finalized messaging by Monday for marketing preparation. Bob confirmed an early stable build would be ready.
Conclusion:
The team aligned on next steps: engineering finalizing fixes, marketing preparing content, and a follow-up sync scheduled for Wednesday.
"""
)
Starting With G-Eval
Feel free to skip this section if you've already decided that GEval
is not for you.
If you were to do this using GEval
, your evaluation_steps
might look something like this:
- The summary is completely wrong if it misses any of the headings: "intro", "body", "conclusion".
- If the summary has all the complete headings but are in the wrong order, penalize it.
- If the summary has all the correct headings and they are in the right order, give it a perfect score.
Which in term looks something like this in code:
from deepeval.test_case import LLMTestCaseParams
from deepeval.metrics import GEval
metric = GEval(
name="Format Correctness",
evaluation_steps=[
"The `actual_output` is completely wrong if it misses any of the headings: 'intro', 'body', 'conclusion'.",
"If the `actual_output` has all the complete headings but are in the wrong order, penalize it.",
"If the summary has all the correct headings and they are in the right order, give it a perfect score."
]
evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT]
)
However, this might not always give you the perfect score according to your criteria, and is not as deterministic as you think. Instead, you can build a DAGMetric
instead that gives deterministic scores based on the logic you've decided for your evaluation criteria.
Building Your Decision Tree
The DAGMetric
requires you to first construct a decision tree that has direct edges and acyclic in nature. Let's take this decision tree for example:
We can see that the actual_output
of an LLMTestCase
is first processed to extract all headings, before deciding whether they are in the correct ordering. If they are not correct, we give it a score of 0, heavily penalizing it, whereas if it is correct, we check the degree of which they are in the correct ordering. Based on this "degree of correct ordering", we can then decide what score to assign it.
The LLMTestCase
we're showing symbolizes all nodes can get access to an LLMTestCase
at any point in the DAG, but in this example only the first node that extracts all the headings from the actual_output
needed the LLMTestCase
.
We can see that our decision tree involves involves four types of nodes:
TaskNode
s: this node simply processes anLLMTestCase
into the desired format for subsequent judgement.BinaryJudgementNode
s: this node will take in acriteria
, and output a verdict ofTrue
/False
based on whether that criteria has been met.NonBinaryJudgementNode
s: this node will also take in acriteria
, but unlike theBinaryJudgementNode
, theNonBinaryJudgementNode
node have the ability to output a verdict other thanTrue
/False
.VerdictNode
s: theVerdictNode
is always a leaf node, and determines the final output score based on the evaluation path that was taken.
Putting everything into context, the TaskNode
is the node that extracts summary headings from the actual_output
, the BinaryJudgementNode
is the node that determines if all headings are present, while the NonBinaryJudgementNode
determines if they are in the correct order. The final score is determined by the four VerdictNode
s.
Some might skeptical if this complexity is necessary but in reality, you'll quickly realize that the more processing you do, the more deterministic your evaluation gets. You can of course combine the correctness and ordering of the summary headings in one step, but as your crtieria gets more complicated, your evaluation model is likely to hallucinate more and more.
Implementing DAG In Code
Here's how this decision tree would look like in code:
from deepeval.metrics.dag import (
DAG,
TaskNode,
BinaryJudgementNode,
NonBinaryJudgementNode,
VerdictNode,
)
correct_order_node = NonBinaryJudgementNode(
criteria="Are the summary headings in the correct order: 'intro' => 'body' => 'conclusion'?",
children=[
VerdictNode(verdict="Yes", score=10),
VerdictNode(verdict="Two are out of order", score=4),
VerdictNode(verdict="All out of order", score=2),
],
)
correct_headings_node = BinaryJudgementNode(
criteria="Does the summary headings contain all three: 'intro', 'body', and 'conclusion'?",
children=[VerdictNode(verdict=False, score=0), correct_order_node],
)
extract_headings_node = TaskNode(
instructions="Extract all headings in `actual_output`",
evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
output_label="Summary headings",
children=[correct_headings_node, correct_order_node],
)
# create the DAG
dag = DAG(root_node=extract_headings_node)
When creating your DAG, there are three important points to remember:
- There should only be an edge to a parent node if the current node depends on the output of the parent node.
- All nodes, except for
VerdictNode
s, can have access to anLLMTestCase
at any point in time. - All leaf nodes are
VerdictNode
s.
IMPORTANT: You'll see that in our example, extract_headings_node
has correct_order_node
as a child because correct_order_node
's criteria
depends on the extracted summary headings from the actual_output
of the LLMTestCase
.
To make creating a DAGMetric
easier, you should aim to start by sketching out all the criteria and different paths your evaluation can take.
Create Your DAGMetric
Now that you have your DAG, all that's left to do is to simply supply it when creating a DAGMetric
:
from deepeval.metrics import DAGMetric
...
format_correctness = DAGMetric(name="Format Correctness", dag=dag)
format_correctness.measure(test_case)
print(format_correctness.score)
There are one required and six optional parameters when creating a DAGMetric
:
name
: name of metric.dag
: aDeepAcyclicGraph
which represents your evaluation decision tree.- [Optional]
threshold
: a float representing the minimum passing threshold. Defaulted to 0.5. - [Optional]
model
: a string specifying which of OpenAI's GPT models to use, OR any custom LLM model of typeDeepEvalBaseLLM
. Defaulted to 'gpt-4o'. - [Optional]
include_reason
: a boolean which when set toTrue
, will include a reason for its evaluation score. Defaulted toTrue
. - [Optional]
strict_mode
: a boolean which when set toTrue
, enforces a binary metric score: 1 for perfection, 0 otherwise. It also overrides the current threshold and sets it to 1. Defaulted toFalse
. - [Optional]
async_mode
: a boolean which when set toTrue
, enables concurrent execution within themeasure()
method. Defaulted toTrue
. - [Optional]
verbose_mode
: a boolean which when set toTrue
, prints the intermediate steps used to calculate said metric to the console, as outlined in the How Is It Calculated section. Defaulted toFalse
.
DAG Node Types
There are four node types that make up your deep acyclic graph. You'll be using these four node types to define a DAG, as follows:
from deepeval.metrics.dag import DeepAcyclicGraph
dag = DeepAcyclicGraph(root_node=...)
Here, root_node
is of type TaskNode
, BinaryJudgementNode
, or NonBinaryJudgementNode
. Let's go through all of them in more detail.
TaskNode
The TaskNode
is designed specifically for processing data such as parameters from LLMTestCase
s, or even an output from a parent TaskNode
. This allows for the breakdown of text into more atomic units that are better for evaluation.
from typing import Optional, List
from deepeval.metrics.dag import BaseNode
from deepeval.test_case import LLMTestCaseParams
class TaskNode(BaseNode):
instructions: str
output_label: str
children: List[BaseNode]
evaluation_params: Optional[List[LLMTestCaseParams]] = None
There are three mandatory and one optional parameter when creating a TaskNode
:
instructions
: a string specifying how to process parameters of anLLMTestCase
, and/or outputs from a previous parentTaskNode
.output_label
: a string representing the final output. Thechildren
BaseNode
s will use theoutput_label
to reference the output from the currentTaskNode
.children
: a list ofBaseNode
s. There must not be aVerdictNode
in the list of children.- [Optional]
evaluation_params
: a list of typeLLMTestCaseParams
. Include only the parameters that are relevant for processing.
For example, if you intend to breakdown the actual_output
of an LLMTestCase
into distinct sentences, the output_label
would be something like "Extracted Sentences", which children BaseNode
s can reference for subsequent judgement in your decision tree.
BinaryJudgementNode
The BinaryJudgementNode
determines whether the verdict is True
or False
based on the given criteria
.
from typing import Optional, List
from deepeval.metrics.dag import BaseNode, VerdictNode
from deepeval.test_case import LLMTestCaseParams
class BinaryJudgementNode(BaseNode):
criteria: str
children: List[VerdictNode]
evaluation_params: Optional[List[LLMTestCaseParams]] = None
There are two mandatory and one optional parameter when creating a BinaryJudgementNode
:
criteria
: a yes/no question based on output from parent node(s) and optionally parameters from theLLMTestCase
. You DON'T HAVE TO TELL IT to outputTrue
orFalse
.children
: a list of exactly twoVerdictNode
s, one with averdict
value ofTrue
, and the other with a value ofFalse
.- [Optional]
evaluation_params
: a list of typeLLMTestCaseParams
. Include only the parameters that are relevant for evaluation.
If you have a TaskNode
as a parent node (which by the way is automatically set by deepeval
when you supply the list of children
), you can base your criteria
on the output of the parent TaskNode
by referencing the output_label
.
For example, if the parent TaskNode
's output_label
is "Extracted Sentences", you can simply set the criteria
as: "Is the number of extracted sentences greater than 3?".
NonBinaryJudgementNode
The NonBinaryJudgementNode
determines what the verdict is based on the given criteria
.
from typing import Optional, List
from deepeval.metrics.dag import BaseNode, VerdictNode
from deepeval.test_case import LLMTestCaseParams
class NonBinaryJudgementNode(BaseNode):
criteria: str
children: List[VerdictNode]
evaluation_params: Optional[List[LLMTestCaseParams]] = None
There are two mandatory and one optional parameter when creating a NonBinaryJudgementNode
:
criteria
: an open-ended question based on output from parent node(s) and optionally parameters from theLLMTestCase
. You DON'T HAVE TO TELL IT what to output.children
: a list ofVerdictNode
s, where theverdict
values determine the possible verdict of the currentNonBinaryJudgementNode
.- [Optional]
evaluation_params
: a list of typeLLMTestCaseParams
. Include only the parameters that are relevant for evaluation.
VerdictNode
The VerdictNode
is always a leaf node and must not be the root node of your DAG. The verdict node contains no additional logic, and simply returns the determined score based on the specified verdict.
from typing import Union
from deepeval.metrics.dag import BaseNode
class VerdictNode(BaseNode):
verdict: Union[str, bool]
score: int
There are two mandatory parameters when creating a VerdictNode
:
verdict
: a string OR boolean representing the possible outcomes of the previous parent node. It must be a string if the parent is aNonBinaryJudgementNode
, else boolean if the parent is aBinaryJudgementNode
.score
: a integar between 0 - 10 that determines the final score of yourDAGMetric
based on the specifiedverdict
value.
How Is It Calculated?
The DAGMetric
score is determined by traversing the custom decision tree in topological order, using any evaluation models along the way to perform judgements to determine which path to take.