Using Custom LLMs for Evaluation
All of deepeval
's metrics uses LLMs for evaluation, and is currently defaulted to OpenAI's GPT models. However, for users that don't wish to use OpenAI's GPT models and would instead prefer other providers such as Claude (Anthropic), Gemini (Google), Llama-3 (Meta), or Mistral, deepeval
provides an easy way for anyone to use literaly ANY custom LLM for evaluation.
This guide will show you how to create custom LLMs for evaluation in deepeval
, and demonstrate various methods to enforce valid JSON LLM outputs that are required for evaluation with the following examples:
- Llama-3 8B from Hugging Face
transformers
- Mistral-7B v0.3 from Hugging Face
transformers
- Gemini 1.5 Flash from Vertex AI
- Claude-3 Opus from Anthropic
Creating A Custom LLM
Here's a quick example on a custom Llama-3 8B model being used for evaluation in deepeval
:
import transformers
import torch
from transformers import BitsAndBytesConfig
from transformers import AutoModelForCausalLM, AutoTokenizer
from deepeval.models import DeepEvalBaseLLM
class CustomLlama3_8B(DeepEvalBaseLLM):
def __init__(self):
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
)
model_4bit = AutoModelForCausalLM.from_pretrained(
"meta-llama/Meta-Llama-3-8B-Instruct",
device_map="auto",
quantization_config=quantization_config,
)
tokenizer = AutoTokenizer.from_pretrained(
"meta-llama/Meta-Llama-3-8B-Instruct"
)
self.model = model_4bit
self.tokenizer = tokenizer
def load_model(self):
return self.model
def generate(self, prompt: str) -> str:
model = self.load_model()
pipeline = transformers.pipeline(
"text-generation",
model=model,
tokenizer=self.tokenizer,
use_cache=True,
device_map="auto",
max_length=2500,
do_sample=True,
top_k=5,
num_return_sequences=1,
eos_token_id=self.tokenizer.eos_token_id,
pad_token_id=self.tokenizer.eos_token_id,
)
return pipeline(prompt)
async def a_generate(self, prompt: str) -> str:
return self.generate(prompt)
def get_model_name(self):
return "Llama-3 8B"
There are SIX rules to follow when creating a custom LLM evaluation model:
- Inherit
DeepEvalBaseLLM
. - Implement the
get_model_name()
method, which simply returns a string representing your custom model name. - Implement the
load_model()
method, which will be responsible for returning a model object. - Implement the
generate()
method with one and only one parameter of type string that acts as the prompt to your custom LLM. - The
generate()
method should return the generated string output from your custom LLM. Note that we calledpipeline(prompt)
to access the model generations in this particular example, but this could be different depending on the implementation of your custom model object. - Implement the
a_generate()
method, with the same function signature asgenerate()
. Note that this is an async method. In this example, we calledself.generate(prompt)
, which simply reuses the synchronousgenerate()
method. However, although optional, you should implement an asynchronous version (if possible) to speed up evaluation.
In later sections, you'll an exception to rules 4. and 5., as the generate()
and a_generate()
method can actually be rewritten to optimize custom LLM outputs that are essential for evaluation.
Then, instatiate the CustomLlama3_8B
class and test the generate()
(or a_generate()
) method out:
...
custom_llm = CustomLlama3_8B()
print(custom_llm.generate("Write me a joke"))
Finally, supply it to a metric to run evaluations using your custom LLM:
from deepeval.metrics import AnswerRelevancyMetric
...
metric = AnswerRelevancyMetric(model=custom_llm)
metric.measure(...)
Congratulations 🎉! You can now evaluate using any custom LLM of your choice on all LLM evaluation metrics offered by deepeval
.
JSON Confinement for Custom LLMs
This section is also highly applicable if you're looking to benchmark your own LLM, as open-source LLMs often require JSON and output confinement to output valid answers for public benchmarks supported by deepeval
.
In the previous section, we learnt how to create a custom LLM, but if you've ever used custom LLMs for evaluation in deepeval
, you may have encountered the following error:
ValueError: Evaluation LLM outputted an invalid JSON. Please use a better evaluation model.
This error arises when the custom LLM used for evaluation is unable to generate valid JSONs during metric calculation, which stops the evaluation process altogether. This happens because for smaller and less powerful LLMs, prompt engineering alone is not sufficient to enforce JSON outputs, which so happens to be the method used in deepeval
's metrics. As a result, it's vital to find a workaround for users not using OpenAI's GPT models for evaluation.
All of deepeval
's metrics require the evaluation model to generate valid JSONs to extract properties such as: reasons, verdicts, statements, and other types of LLM-generated responses that are later used for calculating metric scores, and so when the generated JSONs required to extract these properties are invalid (eg. missing brackets, incomplete string quotations, extra trailing commas, or mismatched keys), deepeval
won't be able to use the necessary information required for metric calculation. Here's an example of an invalid JSON an open-source model like mistralai/Mistral-7B-Instruct-v0.3
might output:
{
"reaso: "The actual output does directly not address the input",
}
Rewriting the generate()
and a_generate()
Method Signatures
In the previous section, we saw how the generate()
and a_generate()
methods must accept one argument of type str
and return the corresponding LLM generated str
. To enforce JSON outputs generated by your custom LLM, the first step is to rewrite the generate()
and a_generate()
method to accept an additional argument of type BaseModel
, and output a BaseModel
instead of a str
.
The BaseModel
type is a type provided by the pydantic
library, which is an extremely common typing library in Python.
from pydantic import BaseModel
Continuing from the CustomLlama3_8B
example, here is what the method signature for the new generate()
and a_generate()
methods should look like:
from pydantic import BaseModel
class CustomLlama3_8B(DeepEvalBaseLLM):
...
def generate(self, prompt: str, schema: BaseModel) -> BaseModel:
pass
async def a_generate(self, prompt: str, schema: BaseModel) -> BaseModel:
return self.generate(prompt, schema)
You might be wondering, how does changing the method signature help with enforcing JSON outputs?
It helps because in deepeval
's metrics, when there is a schema: BaseModel
argument defined for the generate()
and/or a_generate()
method, deepeval
will inject your generate methods with the Pydantic schemas which you can leverage to enforce JSON outputs. Let's see how we can do that.
Reimplementing the generate()
and a_generate()
Methods
With the new method signatures, deepeval
will now automatically inject your custom LLM with the required Pydantic schemas, which you can leverage to enforce JSON outputs for each LLM generation.
There are many ways to leverage Pydantic schemas to confine LLMs to generate valid JSONs, and continuing with our CustomLlama3_8B
example we will be using the lm-format-enforcer
library to confine JSON outputs using the provided Pydantic schema.
pip install lm-format-enforcer
import json
import transformers
from pydantic import BaseModel
from lmformatenforcer import JsonSchemaParser
from lmformatenforcer.integrations.transformers import (
build_transformers_prefix_allowed_tokens_fn,
)
from deepeval.models import DeepEvalBaseLLM
class CustomLlama3_8B(DeepEvalBaseLLM):
...
def generate(self, prompt: str, schema: BaseModel) -> BaseModel:
# Same as the previous example above
model = self.load_model()
pipeline = transformers.pipeline(
"text-generation",
model=model,
tokenizer=self.tokenizer,
use_cache=True,
device_map="auto",
max_length=2500,
do_sample=True,
top_k=5,
num_return_sequences=1,
eos_token_id=self.tokenizer.eos_token_id,
pad_token_id=self.tokenizer.eos_token_id,
)
# Create parser required for JSON confinement using lmformatenforcer
parser = JsonSchemaParser(schema.schema())
prefix_function = build_transformers_prefix_allowed_tokens_fn(
pipeline.tokenizer, parser
)
# Output and load valid JSON
output_dict = pipeline(prompt, prefix_allowed_tokens_fn=prefix_function)
output = output_dict[0]["generated_text"][len(prompt) :]
json_result = json.loads(output)
# Return valid JSON object according to the schema DeepEval supplied
return schema(**json_result)
async def a_generate(self, prompt: str, schema: BaseModel) -> BaseModel:
return self.generate(prompt, schema)
We're calling self.generate(prompt, schema)
in the a_generate()
method to keep things simple, but you should aim to implement an asynchronous version of your custom LLM implementation and enforce JSON outputs the same way you would in the generate()
method to keep evaluations fast.
Now, try running metrics with the new generate()
and a_generate()
methods:
from deepeval.metrics import AnswerRelevancyMetric
...
custom_llm = CustomLlama3_8B()
metric = AnswerRelevancyMetric(model=custom_llm)
metric.measure(...)
Congratulations 🎉! You can now evaluate using any custom LLM of your choice on all LLM evaluation metrics offered by deepeval
, without JSON errors (hopefully).
In the next section, we'll go through two JSON confinement libraries that covers a wide range of LLM interfaces.
JSON Confinement libraries
There are two JSON confinement libraries that you should know about depending on the custom LLM you're using:
lm-format-enforcer
: The LM-Format-Enforcer is a versatile library designed to standardize the output formats of language models. It supports Python-based language models across various platforms, including popular frameworks such astransformers
,langchain
,llamaindex
, llama.cpp, vLLM, Haystack, NVIDIA, TensorRT-LLM, and ExLlamaV2. For comprehensive details about the package and advanced usage instructions, please visit the LM-format-enforcer github page. The LM-Format-Enforcer combines a character-level parser with a tokenizer prefix tree. Unlike other libraries that strictly enforce output formats, this method enables LLMs to sequentially generate tokens that meet output format constraints, thereby enhancing the quality of the output.instructor
: Instructor is a user-friendly python library built on top of Pydantic. It enables straightforward confinement of your LLM's output by encapsulating your LLM client within an Instructor method. It simplifies the process of extracting structured data, such as JSON, from LLMs including GPT-3.5, GPT-4, GPT-4-Vision, and open-source models like Mistral/Mixtral, Anyscale, Ollama, and llama-cpp-python. For more information on advanced usage or integration with other models not covered here, please consult the documentation.
You may wish to wish any JSON confinement libraries out there, and we're just suggesting two that we have found useful when crafting this guide.
In the final section, we'll show several popular end-to-end examples of custom LLMs using either lm-format-enforcer
or instructor
for JSON confinement.
More Examples
Mistral-7B-Instruct-v0.3
through transformers
Begin by installing the lm-format-enforcer
package:
pip install lm-format-enforcer
Here's a full example of a JSON confined custom Mistral 7B model implemented through transformers
:
import json
from pydantic import BaseModel
import torch
from lmformatenforcer import JsonSchemaParser
from lmformatenforcer.integrations.transformers import (
build_transformers_prefix_allowed_tokens_fn,
)
from transformers import BitsAndBytesConfig
from transformers import AutoModelForCausalLM, AutoTokenizer
from deepeval.models import DeepEvalBaseLLM
class CustomMistral7B(DeepEvalBaseLLM):
def __init__(self):
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
)
model_4bit = AutoModelForCausalLM.from_pretrained(
"mistralai/Mistral-7B-Instruct-v0.3",
device_map="auto",
quantization_config=quantization_config,
)
tokenizer = AutoTokenizer.from_pretrained(
"mistralai/Mistral-7B-Instruct-v0.3"
)
self.model = model_4bit
self.tokenizer = tokenizer
def load_model(self):
return self.model
def generate(self, prompt: str, schema: BaseModel) -> BaseModel:
model = self.load_model()
pipeline = pipeline(
"text-generation",
model=model,
tokenizer=self.tokenizer,
use_cache=True,
device_map="auto",
max_length=2500,
do_sample=True,
top_k=5,
num_return_sequences=1,
eos_token_id=self.tokenizer.eos_token_id,
pad_token_id=self.tokenizer.eos_token_id,
)
# Create parser required for JSON confinement using lmformatenforcer
parser = JsonSchemaParser(schema.schema())
prefix_function = build_transformers_prefix_allowed_tokens_fn(
pipeline.tokenizer, parser
)
# Output and load valid JSON
output_dict = pipeline(prompt, prefix_allowed_tokens_fn=prefix_function)
output = output_dict[0]["generated_text"][len(prompt) :]
json_result = json.loads(output)
# Return valid JSON object according to the schema DeepEval supplied
return schema(**json_result)
async def a_generate(self, prompt: str, schema: BaseModel) -> BaseModel:
return self.generate(prompt, schema)
def get_model_name(self):
return "Mistral-7B v0.3"
Similar to the CustomLlama3_8B
example, you can similarly:
- pass in a
quantization_config
parameter if your compute resources are limited - use the
lm-format-enforcer
library for JSON confinement
This is because the CustomMistral7B
model is implemented through HF transformers
as well.
gemini-1.5-flash
through Vertex AI
Begin by installing the instructor
package via pip:
pip install instructor
from pydantic import BaseModel
import google.generativeai as genai
import instructor
from deepeval.models import DeepEvalBaseLLM
class CustomGeminiFlash(DeepEvalBaseLLM):
def __init__(self):
self.model = genai.GenerativeModel(model_name="models/gemini-1.5-flash")
def load_model(self):
return self.model
def generate(self, prompt: str, schema: BaseModel) -> BaseModel:
client = self.load_model()
instructor_client = instructor.from_gemini(
client=client,
mode=instructor.Mode.GEMINI_JSON,
)
resp = instructor_client.messages.create(
messages=[
{
"role": "user",
"content": prompt,
}
],
response_model=schema,
)
return resp
async def a_generate(self, prompt: str, schema: BaseModel) -> BaseModel:
return self.generate(prompt, schema)
def get_model_name(self):
return "Gemini 1.5 Flash"
The instructor
client automatically allows you to create a structured response by defining a response_model
parameter which accepts a Pydantic BaseModel
schema.
claude-3-opus
through Anthropic
Begin by installing the instructor
package via pip:
pip install instructor
from pydantic import BaseModel
from anthropic import Anthropic
from deepeval.models import DeepEvalBaseLLM
class CustomClaudeOpus(DeepEvalBaseLLM):
def __init__(self):
self.model = Anthropic()
def load_model(self):
return self.model
def generate(self, prompt: str, schema: BaseModel) -> BaseModel:
client = self.load_model()
instructor_client = instructor.from_anthropic(client)
resp = instructor_client.messages.create(
model="claude-3-opus-20240229",
max_tokens=1024,
messages=[
{
"role": "user",
"content": prompt,
}
],
response_model=schema,
)
return resp
async def a_generate(self, prompt: str, schema: BaseModel) -> BaseModel:
return self.generate(prompt, schema)
def get_model_name(self):
return "Claude-3 Opus"
Others
For any additional implementations, please come and ask away in the DeepEval discord server, we'll be happy to have you.