Introduction
Quick Summary
deepeval
's Synthesizer
offers a fast and easy way to generate high-quality goldens (inputs, expected outputs, and contexts) for your evaluation datasets in just a few lines of code. This is especially helpful if you don't have an evaluation dataset to start with.
from deepeval.synthesizer import Synthesizer
synthesizer = Synthesizer()
synthesizer.generate_goldens_from_docs(...)
print(synthesizer.synthetic_goldens)
The Synthesizer
uses an LLM to first generate a series of inputs, before evolving them to become more complex and realistic. These evolved inputs are then used to create a list of synthetic Golden
s, which makes up your synthetic EvaluationDataset
.
deepeval
's Synthesizer
uses the data evolution method to generate large volumes of data across various complexity levels to make synthetic data more realistic. This method was originally introduced by the developers of Evol-Instruct and WizardML.
For those interested, here is a great article on how deepeval
's synthesizer was built.
Create Your First Synthesizer
To start generating goldens for your EvaluationDataset
, begin by creating a Synthesizer
object:
from deepeval.synthesizer import Synthesizer
synthesizer = Synthesizer()
There are five optional parameters when creating a Synthesizer
:
- [Optional]
async_mode
: a boolean which when set toTrue
, enables concurrent generation of goldens. Defaulted toTrue
. - [Optional]
model
: a string specifying which of OpenAI's GPT models to use for generation, OR any custom LLM model of typeDeepEvalBaseLLM
. Defaulted togpt-4o
. - [Optional]
filtration_config
: an instance of typeFiltrationConfig
that allows you to customize the degree of which goldens are filtered during generation. Defaulted to the defaultFiltrationConfig
values. - [Optional]
evolution_config
: an instance of typeEvolutionConfig
that allows you to customize the complexity of evolutions applied during generation. Defaulted to the defaultEvolutionConfig
values. - [Optional]
styling_config
: an instance of typeStylingConfig
that allows you to customize the styles and formats of generations. Defaulted to the defaultStylingConfig
values.
The filteration_config
, evolution_config
, and styling_config
parameter allows you to customize the goldens being generated by your Synthesizer
.
Generate Your First Golden
Once you've created a Synthesizer
object with the desired filtering parameters and models, you can begin generating goldens.
from deepeval.synthesizer import Synthesizer
...
synthesizer.generate_goldens_from_docs(
document_paths=['example.txt', 'example.docx', 'example.pdf'],
include_expected_output=True
)
print(synthesizer.synthetic_goldens)
In this example, we've used the generate_goldens_from_docs
method, which one one of the three generation methods offered by deepeval
's Synthesizer
. The three methods include:
generate_goldens_from_docs()
: useful for generating goldens to evaluate your LLM application based on contexts extracted from your knowledge base in the form of documents.generate_goldens_from_contexts()
: useful for generating goldens to evaluate your LLM application based on a list of prepared context.generate_goldens_from_scratch()
: useful for generating goldens to evaluate your LLM application without relying on contexts from a knowledge base.
You might have noticed the generate_goldens_from_docs()
is a superset of generate_goldens_from_contexts()
, and generate_goldens_from_contexts()
is a superset of generate_goldens_from_scratch()
.
This implies that if you want more control over context extraction, you should use generate_goldens_from_contexts()
, but if you want deepeval
to take care of context extraction as well, use generate_goldens_from_docs()
.
Once generation is complete, you can also convert your synthetically generated goldens into a DataFrame:
dataframe = synthesizer.to_pandas()
print(dataframe)
Here’s an example of what the resulting DataFrame might look like:
input | actual_output | expected_output | input | retrieval_context | n_chunks_per_context | context_length | context_quality | synthetic_input_quality | evolutions | source_file |
---|---|---|---|---|---|---|---|---|---|---|
Who wrote the novel "1984"? | None | George Orwell | ["1984 is a dystopian novel published in 1949 by George Orwell."] | None | 1 | 60 | 0.5 | 0.6 | None | file1.txt |
What is the boiling point of water in Celsius? | None | 100°C | ["Water boils at 100°C (212°F) under standard atmospheric pressure."] | None | 1 | 55 | 0.4 | 0.9 | None | file2.txt |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
And that's it! You now have access to a list of synthetic goldens generated using information from your knowledge base.
Save Your Synthetic Dataset
On Confident AI
To avoid losing any generated synthetic Goldens
, you can push a dataset containing the generated goldens to Confident AI:
from deepeval.dataset import EvaluationDataset
...
dataset = EvaluationDataset(goldens=synthesizer.synthetic_goldens)
dataset.push(alias="My Generated Dataset")
This keeps your dataset on the cloud and you'll be able to edit and version control it in one place. When you are ready to evaluate your LLM application using the generated goldens, simply pull the dataset from the cloud like how you would pull a GitHub repo:
from deepeval import evaluate
from deepeval.dataset import EvaluationDataset
from deepeval.metrics import AnswerRelevancyMetric
...
dataset = EvaluationDataset()
# Same alias as before
dataset.pull(alias="My Generated Dataset")
evaluate(dataset, metrics=[AnswerRelevancyMetric()])
Locally
Altneratively, you can use the save_as()
method to save synthetic goldens locally:
synthesizer.save_as(
# Or 'csv'
file_type='json',
directory="./synthetic_data"
)
Customize Your Generations
deepeval
's Synthesizer
's generation pipeline is made up of several components, which you can easily customize to determine the quality and style of the resulting generated goldens.
You might find it useful to first learn about all the different components and steps that make up the Synthesizer
generation pipeline.
Filteration Quality
You can customize the degree of which generated goldens are filtered away to ensure the quality of synthetic inputs by instantiating the Synthesizer
with a FiltrationConfig
instance.
from deepeval.synthesizer import Synthesizer
from deepeval.synthesizer.config import FiltrationConfig
filteration_config = FiltrationConfig(
critic_model="gpt-4o",
synthetic_input_quality_threshold=0.5
)
synthesizer = Synthesizer(filteration_config=filteration_config)
There are three optional parameters when creating a FiltrationConfig
:
- [Optional]
critic_model
: a string specifying which of OpenAI's GPT models to use to determine inputquality_score
s, OR any custom LLM model of typeDeepEvalBaseLLM
. Defaulted togpt-4o
. - [Optional]
synthetic_input_quality_threshold
: a float representing the minimum quality threshold for synthetic input generation. Inputs withquality_score
s lower than thesynthetic_input_quality_threshold
will be rejected. Defaulted to0.5
. - [Optional]
max_quality_retries
: an integer that specifies the number of times to retry synthetic input generation if it does not meet the required quality. Defaulted to3
.
If the quality_score
is still lower than the synthetic_input_quality_threshold
after max_quality_retries
, the golden with the highest quality_score
will be used.
Evolution Complexity
You can customize the evolution types and depth applied by instantiating the Synthesizer
with an EvolutionConfig
instance. You should customize the EvolutionConfig
to vary the complexity of the generated goldens.
from deepeval.synthesizer import synthesizer
from deepeval.synthesizer.config import EvolutionConfig
evolution_config = EvolutionConfig(
evolutions={
Evolution.REASONING: 1/4,
Evolution.MULTICONTEXT: 1/4,
Evolution.CONCRETIZING: 1/4,
Evolution.CONSTRAINED: 1/4
},
num_evolutions=4
)
synthesizer = Synthesizer(evolution_config=evolution_config)
There are two optional parameters when creating an EvolutionConfig
:
- [Optional]
evolutions
: a dict withEvolution
keys and sampling probability values, specifying the distribution of data evolutions to be used. Defaulted to allEvolution
s with equal probability. - [Optional]
num_evolutions
: the number of evolution steps to apply to each generated input. This parameter controls the complexity and diversity of the generated dataset by iteratively refining and evolving the initial inputs. Defaulted to 1.
Evolution
is an ENUM
that specifies the different data evolution techniques you wish to employ to make synthetic Golden
s more realistic. deepeval
's Synthesizer
supports 7 types of evolutions, which are randomly sampled based on a defined distribution. You can apply multiple evolutions to each Golden
, and later access the evolution sequence through the Golden
's additional metadata field.
from deepeval.synthesizer import Evolution
available_evolutions = {
Evolution.REASONING: 1/7,
Evolution.MULTICONTEXT: 1/7,
Evolution.CONCRETIZING: 1/7,
Evolution.CONSTRAINED: 1/7,
Evolution.COMPARATIVE: 1/7,
Evolution.HYPOTHETICAL: 1/7,
Evolution.IN_BREADTH: 1/7,
}
Styling Options
You can customize the output style and format of any input
and/or expected_output
generated by instantiating the Synthesizer
with a StylingConfig
instance.
from deepeval.synthesizer import Synthesizer
from deepeval.synthesizer.config import StylingConfig
styling_config = StylingConfig(
input_format="Questions in English that asks for data in database.",
expected_output_format="SQL query based on the given input",
task="Answering text-to-SQL-related queries by querying a database and returning the results to users"
scenario="Non-technical users trying to query a database using plain English.",
)
synthesizer = Synthesizer(styling_config=styling_config)
There are four optional parameters when creating a StylingConfig
:
- [Optional]
input_format
: a string, which specifies the desired format of the generatedinput
s in the synthesized goldens. Defaulted toNone
. - [Optional]
expected_output_format
: a string, which specifies the desired format of the generatedexpected_output
s in the synthesized goldens. Defaulted toNone
. - [Optional]
task
: a string, representing the purpose of the LLM application you're trying to evaluate are tasked with. Defaulted toNone
. - [Optional]
scenario
: a string, representing the setting of the LLM application you're trying to evaluate are placed in. Defaulted toNone
.
The scenario
, task
, input_format
, and/or expected_output_format
parameters, if provided at all, are used to enforce the styles and formats of any generated goldens.
How Does it Work?
deepeval
's Synthesizer
generation pipeline consists of four main steps:
- Input Generation: Generate synthetic goldens
input
s with or without provided contexts. - Filtration: Filter away any initial synthetic goldens that don't meet the specified generation standards.
- Evolution: Evolve the filtered synthetic goldens to increase complexity and make them more realistic.
- Styling: Style the output formats of the
input
s andexpected_output
s of the evolved synthetic goldens.
This generation pipeline is the same for generate_goldens_from_docs()
, generate_goldens_from_contexts()
, and generate_goldens_from_scratch()
.
There are two steps not mentioned - the context construction step and expected output generation step.
The context construction step (which you can learn how it works here) happens before the initial generation step and the reason why the context construction step isn't mentioned is because it is only required if you're using the generate_goldens_from_docs()
method.
As for the expected output generation step, it's omitted because it is a trivial one-step process that simply happens right before the final styling step.
Input Generation
In the initial input generation step, input
s of goldens are generated with or without provided contexts using an LLM. Provided contexts, which can be in the form of a list of strings or a list of documents, allow generated goldens to be grounded in information presented in your knowledge base.
Filtration
The position of this step might be a surprise to many but, the filtration step happens so early on in the pipeline because deepeval
assumes that goldens that pass the initial filtration step will not degrade in quality upon further evolution and styling.
In the filtration step, input
s of generated goldens are subject to quality filtering. These synthetic input
s are evaluated and assigned a quality score (0-1) by an LLM based on:
- Self-containment: The
input
is understandable and complete without needing additional external context or references. - Clarity: The
input
clearly conveys its intent, specifying the requested information or action without ambiguity.
Any goldens that has a quality scores below the synthetic_input_quality_threshold
will be re-generated. If the quality score still does not meet the required synthetic_input_quality_threshold
after the allowed max_quality_retries
, the most generation with the highest score is used. As a result, some generated Goldens
in your final evaluation dataset may not meet the minimum input quality scores, but you will be guarenteed at least a golden regardless of its quality.
Click here to learn how to customize the synthetic_input_quality_threshold
and max_quality_retries
parameters.
Evolution
In the evolution step, the input
s of the filtered goldens are rewritten to make more complex and realistic, often times indistinguishable from human curated goldens. Each input
is rewritten num_evolutions
times, where each evolution is sampled from the evolution
distribution which adds an additional layer of complexity to the rewritten input
.
Click here To learn how to customize the evolution
and num_evolutions
parameters.
As an example, a golden might take the following evolutionary route when num_evolutions
is set to 2 and evolutions
is a dictionary containing Evolution.IN_BREADTH
, Evolution.COMPARATIVE
, and Evolution.REASONING
, with sampling probabilities of 0.4, 0.2, and 0.4, respectively:
Styling
This might be useful to you if for example you want to generate goldens in another languague, or have the expected_output
s to be in SQL format for a text-sql use case.
In the final styling step, the input
s and expected_outputs
of each golden are rewritten into the desired formats and styles if required. This can be configured by setting the scenario
, task
, input_format
, and expected_output_format
parameters, and deepeval
will use what you have provided to style goldens tailored to your use case at the end of the generation pipeline to ensure all synthetic data makes sense to you.
Click here to learn how to customize the format and style of the synthetic input
s and expected_output
s being generated.