Skip to main content

Generate From Documents

If your application is a Retrieval-Augmented Generation (RAG) system, generating Goldens from documents can be particularly useful, especially if you already have access to the documents that make up your knowledge base. By simply providing these documents, the Synthesizer will automatically handle generating the relevant contexts needed for synthesizing test Goldens.

LangChain
DID YOU KNOW?

The only difference between the generate_goldens_from_docs() and generate_goldens_from_contexts() method is generate_goldens_from_docs() involves an additional context construction step.

Generate Your Goldens

Before you begin, you must install chromadb==v0.5.3 as an additional dependency when generating from documents. The use of a vector database allows for faster indexing and retrieval of chunks during context construction.

pip install chromadb==0.5.3

Then, to generate synthetic Goldens from documents, simply provide a list of document paths:

from deepeval.synthesizer import Synthesizer

synthesizer = Synthesizer()
synthesizer.generate_goldens_from_docs(
document_paths=['example.txt', 'example.docx', 'example.pdf'],
)

There are one mandatory and eleven optional parameters when using the generate_goldens_from_docs method:

  • document_paths: a list strings, representing the path to the documents from which contexts will be extracted from. Supported documents types include: .txt, .docx, and .pdf.
  • [Optional] include_expected_output: a boolean which when set to True, will additionally generate an expected_output for each synthetic Golden. Defaulted to True.
  • [Optional] max_goldens_per_context: the maximum number of goldens to be generated per context. Defaulted to 2.
  • [Optional] context_construction_config: an instance of type ContextConstructionConfig that allows you to customize the quality of contexts constructed from your documents. Defaulted to the default ContextConstructionConfig values.
info

The final maximum number of goldens to be generated is the max_goldens_per_context multiplied by the max_contexts_per_document as specified in the context_extraction_config, and NOT simply max_goldens_per_context.

Customize Context Construction

You can customize the quality of contexts constructed from documents by providing a ContextExtractionConfig instance to the generate_goldens_from_docs() method at generation time.

from deepeval.synthesizer.confg import ContextExtractionConfig

...
synthesizer.generate_goldens_from_docs(
document_paths=['example.txt', 'example.docx', 'example.pdf'],
context_extraction_config=ContextExtractionConfig()
)

There are seven optional parameters when creating a ContextExtractionConfig:

  • [Optional] max_contexts_per_document: the maximum number of contexts to be generated per document. Defaulted to 3.
  • [Optional] chunk_size: specifies the size of text chunks (in characters) to be considered during document parsing. Defaulted to 1024.
  • [Optional] chunk_overlap: an int that determines the overlap size between consecutive text chunks during document parsing. Defaulted to 0.
  • [Optional] context_quality_threshold: a float representing the minimum quality threshold for context selection. If the context quality is below threshold, the context will be rejected. Defaulted to 0.5.
  • [Optional] critic_model: a string specifying which of OpenAI's GPT models to use to determine context quality_scores, OR any custom LLM model of type DeepEvalBaseLLM. Defaulted to gpt-4o.
  • [Optional] context_similarity_threshold: a float representing the minimum similarity score required for context grouping. Contexts with similarity scores below this threshold will be rejected. Defaulted to 0.5.
  • [Optional] max_retries: an integer that specifies the number of times to retry context selection OR grouping if it does not meet the required quality OR similarity threshold. Defaulted to 3.
  • [Optional] embedder: a string specifying which of OpenAI's embedding models to during document parsing and context grouping, OR any custom embedding model of type DeepEvalBaseEmbeddingModel. Defaulted to 'text-embedding-3-small'.
caution

Unlike other customizations where configurations to your Synthesizer generation pipeline is defined at point of instantiating a Synthesizer, customizing context construction happens at the generation level because context construction is unique to the generate_goldens_from_docs() method.

To learn how to customize all other aspects of your generation pipeline, such as output formats, evolution complexity, click here.

How Does Context Construction Work?

The generate_goldens_from_docs() method has an additional context construction pipeline that precedes the goldens generation pipeline. This is because to generate goldens grounded in context, we first have to extract and construction groups of contexts found in provided documents.

The context construction pipeline consist of three main steps:

  • Document Parsing: Split documents into smaller, manageable chunks.
  • Context Selection: Select random chunks from the parsed, embedded documents.
  • Context Grouping: Group chunks that are similar in semantics (using cosine similarity) to create groups of contexts that are meaningful enough for subsequent generation.

Click here To learn how to customize every parameters used for the context construction pipeline.

info

In summary, the documents are first split into chunks and embedded to form a collection of nodes. Random nodes are then selected, and for each selected node, similar nodes are retrieved and grouped together to create contexts. These contexts are then used to generate synthetic goldens as described in previous sections.

Document Parsing

In the initial document parsing step, each provided document is parsed using an token-based text splitter. This means the chunk_size and chunk_overlap parameters do not guarantee exact text chunk sizes. This approach ensures text chunks are meaningful and coherent, but might lead to variations in the expected size of each context.

These text chunks are then embedded by the embedder and stored in a vector database for subssequent selection and grouping.

caution

The synthesizer will raise an error if chunk_size is too large to generate n=max_contexts_per_document unique contexts.

Context Selection

In the context selection step, random nodes are selected from the vector database that contains the previously indexed nodes. Each time a node is selected, it is subject to filtering. This is because chunked contexts can result in trivial or undesirable content, such as a series of white spaces or unwanted characters from document structures, which is why filtering is important to ensure subsequently generated goldens are meaningful, relevant, and coherent.

Each chunk is quality scored (0-1) by an LLM (the critic_model) based based on the following criteria:

  • Clarity: How clear and understandable the information is.
  • Depth: The level of detail and insight provided.
  • Structure: How well-organized and logical the content is.
  • Relevance: How closely the content relates to the main topic.

If the quality score is still lower than the context_quality_threshold after max_retries, the context with the highest quality score will be used. Although this means that you might find context that have failed the filtering process being used, but you will be guarenteed to have context to be used for grouping.

note

The critic_model in the context construction pipeline can be different from the one used in the FiltrationConfig of the generation pipeline.

Context Grouping

In the final context grouping step, each previously selected nodes are grouped with other nodes with a cosine similarity score higher than the context_similarity_threshold. This ensure that each context is coherent for subsequent generation to happen smoothly.

Similar to the context selection step, if the cosine similarity is still lower than the context_similarity_threshold after max_retries, the context with the highest similarity score will be used. Although this means that you might find context that have failed the filtering process being used, but you will be guarenteed to have context groups to be used for generation.

LangChain