Generate From Documents
If your application is a Retrieval-Augmented Generation (RAG) system, generating Goldens from documents can be particularly useful, especially if you already have access to the documents that make up your knowledge base. By simply providing these documents, the Synthesizer will automatically handle generating the relevant contexts needed for synthesizing test Goldens.
The only difference between the generate_goldens_from_docs()
and generate_goldens_from_contexts()
method is generate_goldens_from_docs()
involves an additional context construction step.
Generate Your Goldens
Before you begin, you must install chromadb==v0.5.3
as an additional dependency when generating from documents. The use of a vector database allows for faster indexing and retrieval of chunks during context construction.
pip install chromadb==0.5.3
Then, to generate synthetic Golden
s from documents, simply provide a list of document paths:
from deepeval.synthesizer import Synthesizer
synthesizer = Synthesizer()
synthesizer.generate_goldens_from_docs(
document_paths=['example.txt', 'example.docx', 'example.pdf'],
)
There are one mandatory and eleven optional parameters when using the generate_goldens_from_docs
method:
document_paths
: a list strings, representing the path to the documents from which contexts will be extracted from. Supported documents types include:.txt
,.docx
, and.pdf
.- [Optional]
include_expected_output
: a boolean which when set toTrue
, will additionally generate anexpected_output
for each syntheticGolden
. Defaulted toTrue
. - [Optional]
max_goldens_per_context
: the maximum number of goldens to be generated per context. Defaulted to 2. - [Optional]
context_construction_config
: an instance of typeContextConstructionConfig
that allows you to customize the quality of contexts constructed from your documents. Defaulted to the defaultContextConstructionConfig
values.
The final maximum number of goldens to be generated is the max_goldens_per_context
multiplied by the max_contexts_per_document
as specified in the context_extraction_config
, and NOT simply max_goldens_per_context
.
Customize Context Construction
You can customize the quality of contexts constructed from documents by providing a ContextExtractionConfig
instance to the generate_goldens_from_docs()
method at generation time.
from deepeval.synthesizer.confg import ContextExtractionConfig
...
synthesizer.generate_goldens_from_docs(
document_paths=['example.txt', 'example.docx', 'example.pdf'],
context_extraction_config=ContextExtractionConfig()
)
There are seven optional parameters when creating a ContextExtractionConfig
:
- [Optional]
max_contexts_per_document
: the maximum number of contexts to be generated per document. Defaulted to 3. - [Optional]
chunk_size
: specifies the size of text chunks (in characters) to be considered during document parsing. Defaulted to 1024. - [Optional]
chunk_overlap
: an int that determines the overlap size between consecutive text chunks during document parsing. Defaulted to 0. - [Optional]
context_quality_threshold
: a float representing the minimum quality threshold for context selection. If the context quality is below threshold, the context will be rejected. Defaulted to0.5
. - [Optional]
critic_model
: a string specifying which of OpenAI's GPT models to use to determine contextquality_score
s, OR any custom LLM model of typeDeepEvalBaseLLM
. Defaulted togpt-4o
. - [Optional]
context_similarity_threshold
: a float representing the minimum similarity score required for context grouping. Contexts with similarity scores below this threshold will be rejected. Defaulted to0.5
. - [Optional]
max_retries
: an integer that specifies the number of times to retry context selection OR grouping if it does not meet the required quality OR similarity threshold. Defaulted to3
. - [Optional]
embedder
: a string specifying which of OpenAI's embedding models to during document parsing and context grouping, OR any custom embedding model of typeDeepEvalBaseEmbeddingModel
. Defaulted to 'text-embedding-3-small'.
Unlike other customizations where configurations to your Synthesizer
generation pipeline is defined at point of instantiating a Synthesizer
, customizing context construction happens at the generation level because context construction is unique to the generate_goldens_from_docs()
method.
To learn how to customize all other aspects of your generation pipeline, such as output formats, evolution complexity, click here.
How Does Context Construction Work?
The generate_goldens_from_docs()
method has an additional context construction pipeline that precedes the goldens generation pipeline. This is because to generate goldens grounded in context, we first have to extract and construction groups of contexts found in provided documents.
The context construction pipeline consist of three main steps:
- Document Parsing: Split documents into smaller, manageable chunks.
- Context Selection: Select random chunks from the parsed, embedded documents.
- Context Grouping: Group chunks that are similar in semantics (using cosine similarity) to create groups of contexts that are meaningful enough for subsequent generation.
Click here To learn how to customize every parameters used for the context construction pipeline.
In summary, the documents are first split into chunks and embedded to form a collection of nodes. Random nodes are then selected, and for each selected node, similar nodes are retrieved and grouped together to create contexts. These contexts are then used to generate synthetic goldens as described in previous sections.
Document Parsing
In the initial document parsing step, each provided document is parsed using an token-based text splitter. This means the chunk_size
and chunk_overlap
parameters do not guarantee exact text chunk sizes. This approach ensures text chunks are meaningful and coherent, but might lead to variations in the expected size of each context
.
These text chunks are then embedded by the embedder
and stored in a vector database for subssequent selection and grouping.
The synthesizer will raise an error if chunk_size
is too large to generate n=max_contexts_per_document
unique contexts.
Context Selection
In the context selection step, random nodes are selected from the vector database that contains the previously indexed nodes. Each time a node is selected, it is subject to filtering. This is because chunked contexts can result in trivial or undesirable content, such as a series of white spaces or unwanted characters from document structures, which is why filtering is important to ensure subsequently generated goldens are meaningful, relevant, and coherent.
Each chunk is quality scored (0-1) by an LLM (the critic_model
) based based on the following criteria:
- Clarity: How clear and understandable the information is.
- Depth: The level of detail and insight provided.
- Structure: How well-organized and logical the content is.
- Relevance: How closely the content relates to the main topic.
If the quality score is still lower than the context_quality_threshold
after max_retries
, the context with the highest quality score will be used. Although this means that you might find context that have failed the filtering process being used, but you will be guarenteed to have context to be used for grouping.
The critic_model
in the context construction pipeline can be different from the one used in the FiltrationConfig
of the generation pipeline.
Context Grouping
In the final context grouping step, each previously selected nodes are grouped with other nodes with a cosine similarity score higher than the context_similarity_threshold
. This ensure that each context is coherent for subsequent generation to happen smoothly.
Similar to the context selection step, if the cosine similarity is still lower than the context_similarity_threshold
after max_retries
, the context with the highest similarity score will be used. Although this means that you might find context that have failed the filtering process being used, but you will be guarenteed to have context groups to be used for generation.