Skip to main content

Using Custom LLMs for Evaluation

All of deepeval's metrics uses LLMs for evaluation, and is currently defaulted to OpenAI's GPT models. However, for users that don't wish to use OpenAI's GPT models and would instead prefer other providers such as Claude (Anthropic), Gemini (Google), Llama-3 (Meta), or Mistral, deepeval provides an easy way for anyone to use literaly ANY custom LLM for evaluation.

This guide will show you how to create custom LLMs for evaluation in deepeval, and demonstrate various methods to enforce valid JSON LLM outputs that are required for evaluation with the following examples:

  • Llama-3 8B from Hugging Face transformers
  • Mistral-7B v0.3 from Hugging Face transformers
  • Gemini 1.5 Flash from Vertex AI
  • Claude-3 Opus from Anthropic

Creating A Custom LLM

Here's a quick example on a custom Llama-3 8B model being used for evaluation in deepeval:

import transformers
import torch
from transformers import BitsAndBytesConfig
from transformers import AutoModelForCausalLM, AutoTokenizer

from deepeval.models import DeepEvalBaseLLM


class CustomLlama3_8B(DeepEvalBaseLLM):
def __init__(self):
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
)

model_4bit = AutoModelForCausalLM.from_pretrained(
"meta-llama/Meta-Llama-3-8B-Instruct",
device_map="auto",
quantization_config=quantization_config,
)
tokenizer = AutoTokenizer.from_pretrained(
"meta-llama/Meta-Llama-3-8B-Instruct"
)

self.model = model_4bit
self.tokenizer = tokenizer

def load_model(self):
return self.model

def generate(self, prompt: str) -> str:
model = self.load_model()

pipeline = transformers.pipeline(
"text-generation",
model=model,
tokenizer=self.tokenizer,
use_cache=True,
device_map="auto",
max_length=2500,
do_sample=True,
top_k=5,
num_return_sequences=1,
eos_token_id=self.tokenizer.eos_token_id,
pad_token_id=self.tokenizer.eos_token_id,
)

return pipeline(prompt)

async def a_generate(self, prompt: str) -> str:
return self.generate(prompt)

def get_model_name(self):
return "Llama-3 8B"

There are SIX rules to follow when creating a custom LLM evaluation model:

  1. Inherit DeepEvalBaseLLM.
  2. Implement the get_model_name() method, which simply returns a string representing your custom model name.
  3. Implement the load_model() method, which will be responsible for returning a model object.
  4. Implement the generate() method with one and only one parameter of type string that acts as the prompt to your custom LLM.
  5. The generate() method should return the generated string output from your custom LLM. Note that we called pipeline(prompt) to access the model generations in this particular example, but this could be different depending on the implementation of your custom model object.
  6. Implement the a_generate() method, with the same function signature as generate(). Note that this is an async method. In this example, we called self.generate(prompt), which simply reuses the synchronous generate() method. However, although optional, you should implement an asynchronous version (if possible) to speed up evaluation.
caution

In later sections, you'll an exception to rules 4. and 5., as the generate() and a_generate() method can actually be rewritten to optimize custom LLM outputs that are essential for evaluation.

Then, instatiate the CustomLlama3_8B class and test the generate() (or a_generate()) method out:

...

custom_llm = CustomLlama3_8B()
print(custom_llm.generate("Write me a joke"))

Finally, supply it to a metric to run evaluations using your custom LLM:

from deepeval.metrics import AnswerRelevancyMetric
...

metric = AnswerRelevancyMetric(model=custom_llm)
metric.measure(...)

Congratulations 🎉! You can now evaluate using any custom LLM of your choice on all LLM evaluation metrics offered by deepeval.

JSON Confinement for Custom LLMs

tip

This section is also highly applicable if you're looking to benchmark your own LLM, as open-source LLMs often require JSON and output confinement to output valid answers for public benchmarks supported by deepeval.

In the previous section, we learnt how to create a custom LLM, but if you've ever used custom LLMs for evaluation in deepeval, you may have encountered the following error:

ValueError: Evaluation LLM outputted an invalid JSON. Please use a better evaluation model.

This error arises when the custom LLM used for evaluation is unable to generate valid JSONs during metric calculation, which stops the evaluation process altogether. This happens because for smaller and less powerful LLMs, prompt engineering alone is not sufficient to enforce JSON outputs, which so happens to be the method used in deepeval's metrics. As a result, it's vital to find a workaround for users not using OpenAI's GPT models for evaluation.

info

All of deepeval's metrics require the evaluation model to generate valid JSONs to extract properties such as: reasons, verdicts, statements, and other types of LLM-generated responses that are later used for calculating metric scores, and so when the generated JSONs required to extract these properties are invalid (eg. missing brackets, incomplete string quotations, extra trailing commas, or mismatched keys), deepeval won't be able to use the necessary information required for metric calculation. Here's an example of an invalid JSON an open-source model like mistralai/Mistral-7B-Instruct-v0.3 might output:

{
"reaso: "The actual output does directly not address the input",
}

Rewriting the generate() and a_generate() Method Signatures

In the previous section, we saw how the generate() and a_generate() methods must accept one argument of type str and return the corresponding LLM generated str. To enforce JSON outputs generated by your custom LLM, the first step is to rewrite the generate() and a_generate() method to accept an additional argument of type BaseModel, and output a BaseModel instead of a str.

note

The BaseModel type is a type provided by the pydantic library, which is an extremely common typing library in Python.

from pydantic import BaseModel

Continuing from the CustomLlama3_8B example, here is what the method signature for the new generate() and a_generate() methods should look like:

from pydantic import BaseModel

class CustomLlama3_8B(DeepEvalBaseLLM):
...

def generate(self, prompt: str, schema: BaseModel) -> BaseModel:
pass

async def a_generate(self, prompt: str, schema: BaseModel) -> BaseModel:
return self.generate(prompt, schema)

You might be wondering, how does changing the method signature help with enforcing JSON outputs?

It helps because in deepeval's metrics, when there is a schema: BaseModel argument defined for the generate() and/or a_generate() method, deepeval will inject your generate methods with the Pydantic schemas which you can leverage to enforce JSON outputs. Let's see how we can do that.

Reimplementing the generate() and a_generate() Methods

With the new method signatures, deepeval will now automatically inject your custom LLM with the required Pydantic schemas, which you can leverage to enforce JSON outputs for each LLM generation.

There are many ways to leverage Pydantic schemas to confine LLMs to generate valid JSONs, and continuing with our CustomLlama3_8B example we will be using the lm-format-enforcer library to confine JSON outputs using the provided Pydantic schema.

pip install lm-format-enforcer
import json
import transformers
from pydantic import BaseModel
from lmformatenforcer import JsonSchemaParser
from lmformatenforcer.integrations.transformers import (
build_transformers_prefix_allowed_tokens_fn,
)

from deepeval.models import DeepEvalBaseLLM


class CustomLlama3_8B(DeepEvalBaseLLM):
...

def generate(self, prompt: str, schema: BaseModel) -> BaseModel:
# Same as the previous example above
model = self.load_model()
pipeline = transformers.pipeline(
"text-generation",
model=model,
tokenizer=self.tokenizer,
use_cache=True,
device_map="auto",
max_length=2500,
do_sample=True,
top_k=5,
num_return_sequences=1,
eos_token_id=self.tokenizer.eos_token_id,
pad_token_id=self.tokenizer.eos_token_id,
)

# Create parser required for JSON confinement using lmformatenforcer
parser = JsonSchemaParser(schema.schema())
prefix_function = build_transformers_prefix_allowed_tokens_fn(
pipeline.tokenizer, parser
)

# Output and load valid JSON
output_dict = pipeline(prompt, prefix_allowed_tokens_fn=prefix_function)
output = output_dict[0]["generated_text"][len(prompt) :]
json_result = json.loads(output)

# Return valid JSON object according to the schema DeepEval supplied
return schema(**json_result)

async def a_generate(self, prompt: str, schema: BaseModel) -> BaseModel:
return self.generate(prompt, schema)

tip

We're calling self.generate(prompt, schema) in the a_generate() method to keep things simple, but you should aim to implement an asynchronous version of your custom LLM implementation and enforce JSON outputs the same way you would in the generate() method to keep evaluations fast.

Now, try running metrics with the new generate() and a_generate() methods:

from deepeval.metrics import AnswerRelevancyMetric
...

custom_llm = CustomLlama3_8B()
metric = AnswerRelevancyMetric(model=custom_llm)
metric.measure(...)

Congratulations 🎉! You can now evaluate using any custom LLM of your choice on all LLM evaluation metrics offered by deepeval, without JSON errors (hopefully).

In the next section, we'll go through two JSON confinement libraries that covers a wide range of LLM interfaces.

JSON Confinement libraries

There are two JSON confinement libraries that you should know about depending on the custom LLM you're using:

  1. lm-format-enforcer: The LM-Format-Enforcer is a versatile library designed to standardize the output formats of language models. It supports Python-based language models across various platforms, including popular frameworks such as transformers, langchain, llamaindex, llama.cpp, vLLM, Haystack, NVIDIA, TensorRT-LLM, and ExLlamaV2. For comprehensive details about the package and advanced usage instructions, please visit the LM-format-enforcer github page. The LM-Format-Enforcer combines a character-level parser with a tokenizer prefix tree. Unlike other libraries that strictly enforce output formats, this method enables LLMs to sequentially generate tokens that meet output format constraints, thereby enhancing the quality of the output.

  2. instructor: Instructor is a user-friendly python library built on top of Pydantic. It enables straightforward confinement of your LLM's output by encapsulating your LLM client within an Instructor method. It simplifies the process of extracting structured data, such as JSON, from LLMs including GPT-3.5, GPT-4, GPT-4-Vision, and open-source models like Mistral/Mixtral, Anyscale, Ollama, and llama-cpp-python. For more information on advanced usage or integration with other models not covered here, please consult the documentation.

note

You may wish to wish any JSON confinement libraries out there, and we're just suggesting two that we have found useful when crafting this guide.

In the final section, we'll show several popular end-to-end examples of custom LLMs using either lm-format-enforcer or instructor for JSON confinement.

More Examples

Mistral-7B-Instruct-v0.3 through transformers

Begin by installing the lm-format-enforcer package:

pip install lm-format-enforcer

Here's a full example of a JSON confined custom Mistral 7B model implemented through transformers:

import json
from pydantic import BaseModel
import torch
from lmformatenforcer import JsonSchemaParser
from lmformatenforcer.integrations.transformers import (
build_transformers_prefix_allowed_tokens_fn,
)
from transformers import BitsAndBytesConfig
from transformers import AutoModelForCausalLM, AutoTokenizer

from deepeval.models import DeepEvalBaseLLM


class CustomMistral7B(DeepEvalBaseLLM):
def __init__(self):
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
)

model_4bit = AutoModelForCausalLM.from_pretrained(
"mistralai/Mistral-7B-Instruct-v0.3",
device_map="auto",
quantization_config=quantization_config,
)
tokenizer = AutoTokenizer.from_pretrained(
"mistralai/Mistral-7B-Instruct-v0.3"
)

self.model = model_4bit
self.tokenizer = tokenizer

def load_model(self):
return self.model

def generate(self, prompt: str, schema: BaseModel) -> BaseModel:
model = self.load_model()
pipeline = pipeline(
"text-generation",
model=model,
tokenizer=self.tokenizer,
use_cache=True,
device_map="auto",
max_length=2500,
do_sample=True,
top_k=5,
num_return_sequences=1,
eos_token_id=self.tokenizer.eos_token_id,
pad_token_id=self.tokenizer.eos_token_id,
)

# Create parser required for JSON confinement using lmformatenforcer
parser = JsonSchemaParser(schema.schema())
prefix_function = build_transformers_prefix_allowed_tokens_fn(
pipeline.tokenizer, parser
)

# Output and load valid JSON
output_dict = pipeline(prompt, prefix_allowed_tokens_fn=prefix_function)
output = output_dict[0]["generated_text"][len(prompt) :]
json_result = json.loads(output)

# Return valid JSON object according to the schema DeepEval supplied
return schema(**json_result)

async def a_generate(self, prompt: str, schema: BaseModel) -> BaseModel:
return self.generate(prompt, schema)

def get_model_name(self):
return "Mistral-7B v0.3"
note

Similar to the CustomLlama3_8B example, you can similarly:

  • pass in a quantization_config parameter if your compute resources are limited
  • use the lm-format-enforcer library for JSON confinement

This is because the CustomMistral7B model is implemented through HF transformers as well.

gemini-1.5-flash through Vertex AI

Begin by installing the instructor package via pip:

pip install instructor
from pydantic import BaseModel
import google.generativeai as genai
import instructor

from deepeval.models import DeepEvalBaseLLM


class CustomGeminiFlash(DeepEvalBaseLLM):
def __init__(self):
self.model = genai.GenerativeModel(model_name="models/gemini-1.5-flash")

def load_model(self):
return self.model

def generate(self, prompt: str, schema: BaseModel) -> BaseModel:
client = self.load_model()
instructor_client = instructor.from_gemini(
client=client,
mode=instructor.Mode.GEMINI_JSON,
)
resp = instructor_client.messages.create(
messages=[
{
"role": "user",
"content": prompt,
}
],
response_model=schema,
)
return resp

async def a_generate(self, prompt: str, schema: BaseModel) -> BaseModel:
return self.generate(prompt, schema)

def get_model_name(self):
return "Gemini 1.5 Flash"
info

The instructor client automatically allows you to create a structured response by defining a response_model parameter which accepts a Pydantic BaseModel schema.

claude-3-opus through Anthropic

Begin by installing the instructor package via pip:

pip install instructor
from pydantic import BaseModel
from anthropic import Anthropic

from deepeval.models import DeepEvalBaseLLM


class CustomClaudeOpus(DeepEvalBaseLLM):
def __init__(self):
self.model = Anthropic()

def load_model(self):
return self.model

def generate(self, prompt: str, schema: BaseModel) -> BaseModel:
client = self.load_model()
instructor_client = instructor.from_anthropic(client)
resp = instructor_client.messages.create(
model="claude-3-opus-20240229",
max_tokens=1024,
messages=[
{
"role": "user",
"content": prompt,
}
],
response_model=schema,
)
return resp

async def a_generate(self, prompt: str, schema: BaseModel) -> BaseModel:
return self.generate(prompt, schema)

def get_model_name(self):
return "Claude-3 Opus"

Others

For any additional implementations, please come and ask away in the DeepEval discord server, we'll be happy to have you.