Skip to main content

Metrics

Quick Summary

In deepeval, a metric serves as a standard of measurement for evaluating the performance of an LLM output based on a specific criteria of interest. Essentially, while the metric acts as the ruler, a test case represents the thing you're trying to measure. deepeval offers a range of default metrics for you to quickly get started with, such as:

  • G-Eval
  • Summarization
  • Faithfulness
  • Answer Relevancy
  • Contextual Relevancy
  • Contextual Precision
  • Contextual Recall
  • Ragas
  • Hallucination
  • Toxicity
  • Bias

deepeval also offers you a straightforward way to develop your own custom evaluation metrics. All metrics are measured on a test case. Visit the test cases section to learn how to apply any metric on test cases for evaluation.

Types of Metrics

A custom metric is a type of metric you can easily create by implementing abstract methods and properties of base classes provided by deepeval. They are extremely versitle and seamlessly integrate with Confident AI without requiring any additional setup. As you'll see later, a custom metric can either be an LLM-Eval (LLM evaluated) or classic metric. A classic metric is a type of metric whose criteria isn't evaluated using an LLM.

deepeval also offer default metrics. Most default metrics offered by deepeval are LLM-Evals, which means they are evaluated using LLMs. This is delibrate because LLM-Evals are versitle in nature and better aligns with human expectations when compared to traditional model based approaches.

deepeval's LLM-Evals are a step up to other implementations because they:

  • are extra reliable as LLMs are only used for extremely specific tasks during evaluation to greatly reduce stochasticity and flakiness in scores.
  • provide a comprehensive reason for the scores computed.

All of deepeval's default metrics output a score between 0-1, and require a threshold argument to instantiate. A default metric is only successful if the evaluation score is equal to or greater than threshold.

info

All GPT models from OpenAI are available for LLM-Evals (metrics that use LLMs for evaluation). You can switch between models by providing a string corresponding to OpenAI's model names via the optional model argument when instantiating an LLM-Eval.

Using OpenAI

To use OpenAI for deepeval's LLM-Evals (metrics evaluated using an LLM), supply your OPENAI_API_KEY in the CLI:

export OPENAI_API_KEY=<your-openai-api-key>

Alternatively, if you're working in a notebook enviornment (Jupyter or Colab), set your OPENAI_API_KEY in a cell:

 %env OPENAI_API_KEY=<your-openai-api-key>
note

Please do not include quotation marks when setting your OPENAI_API_KEY if you're working in a notebook enviornment.

Using Azure OpenAI

deepeval also allows you to use Azure OpenAI for metrics that are evaluated using an LLM. Run the following command in the CLI to configure your deepeval enviornment to use Azure OpenAI for all LLM-based metrics.

deepeval set-azure-openai --openai-endpoint=<endpoint> \
--openai-api-key=<api_key> \
--deployment-name=<deployment_name> \
--openai-api-version=<openai_api_version> \
--model-version=<model_version>

Note that the model-version is optional. If you ever wish to stop using Azure OpenAI and move back to regular OpenAI, simply run:

deepeval unset-azure-openai

Using a Custom LLM

deepeval allows you to use ANY custom LLM for evaluation. This includes LLMs from langchain's chat_model module, Hugging Face's transformers library, or even LLMs in GGML format.

caution

We CANNOT guarantee that evaluations will work as expected when using a custom model. This is because evaluation requires high levels of reasoning and the ability to follow instructions such as outputing responses in valid JSON formats.

If you find yourself running into JSON errors and would like to ignore it, use the -c and -i flag during deepeval test run:

deepeval test run test_example.py -i -c

The -i flag ignores errors while the -c flag utilizes the local deepeval cache, so for a partially successful test run you don't have to rerun test cases that didn't error.

Azure OpenAI Example

Here is an example of creating a custom Azure OpenAI model through langchain's AzureChatOpenAI module for evaluation:

from langchain_openai import AzureChatOpenAI
from deepeval.models.base_model import DeepEvalBaseLLM

class AzureOpenAI(DeepEvalBaseLLM):
def __init__(
self,
model
):
self.model = model

def load_model(self):
return self.model

def generate(self, prompt: str) -> str:
chat_model = self.load_model()
return chat_model.invoke(prompt).content

async def a_generate(self, prompt: str) -> str:
chat_model = self.load_model()
res = await chat_model.ainvoke(prompt)
return res.content

def get_model_name(self):
return "Custom Azure OpenAI Model"

# Replace these with real values
custom_model = AzureChatOpenAI(
openai_api_version=openai_api_version,
azure_deployment=azure_deployment,
azure_endpoint=azure_endpoint,
openai_api_key=openai_api_key,
)
azure_openai = AzureOpenAI(model=custom_model)
print(azure_openai.generate("Write me a joke"))

When creating a custom LLM evaluation model you should ALWAYS:

  • inherit DeepEvalBaseLLM.
  • implement the get_model_name() method, which simply returns a string representing your custom model name.
  • implement the load_model() method, which will be responsible for returning a model object.
  • implement the generate() method with one and only one parameter of type string that acts as the prompt to your custom LLM.
  • the generate() method should return the final output string of your custom LLM. Note that we called chat_model.invoke(prompt).content to access the model generations in this particular example, but this could be different depending on the implementation of your custom model object.
  • implement the a_generate() method, with the same function signature as generate(). Note that this is an async method. In this example, we called await chat_model.ainvoke(prompt), which is an asynchronous wrapper provided by LangChain's chat models.
tip

The a_generate() method is what deepeval uses to generate LLM outputs when you execute metrics / run evaluations asynchronously.

If your custom model object does not have an asynchronous interface, simply reuse the same code from generate() (scroll down to the Mistral7B example for more details). However, this would make a_generate() a blocking process, regardless of whether you've turned on async_mode for a metric or not.

Lastly, to use it for evaluation for an LLM-Eval:

from deepeval.metrics import AnswerRelevancyMetric
...

metric = AnswerRelevancyMetric(model=azure_openai)
note

While the Azure OpenAI command configures deepeval to use Azure OpenAI globally for all LLM-Evals, a custom LLM has to be set each time you instantiate a metric. Remember to provide your custom LLM instance through the model parameter for metrics you wish to use it for.

Mistral 7B Example

Here is an example of creating a custom Mistral 7B model through Hugging Face's transformers library for evaluation:

from transformers import AutoModelForCausalLM, AutoTokenizer
from deepeval.models.base_model import DeepEvalBaseLLM

class Mistral7B(DeepEvalBaseLLM):
def __init__(
self,
model,
tokenizer
):
self.model = model
self.tokenizer = tokenizer

def load_model(self):
return self.model

def generate(self, prompt: str) -> str:
model = self.load_model()

device = "cuda" # the device to load the model onto

model_inputs = self.tokenizer([prompt], return_tensors="pt").to(device)
model.to(device)

generated_ids = model.generate(**model_inputs, max_new_tokens=100, do_sample=True)
return self.tokenizer.batch_decode(generated_ids)[0]

async def a_generate(self, prompt: str) -> str:
return self.generate(prompt)

def get_model_name(self):
return "Mistral 7B"

model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.1")
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")

mistral_7b = Mistral7B(model=model, tokenizer=tokenizer)
print(mistral_7b.generate("Write me a joke"))

Note that for this particular implementation, we initialized our Mistral7B model with an additional tokenizer parameter, as this is required in the decoding step of the generate() method.

info

You'll notice we simply reused generate() in a_generate(), because unfortunately there's no asynchronous interface for Hugging Face's transformers library, which would make all metric executions a synchronous, blocking process.

However, you can try offloading the generation process to a separate thread instead:

import asyncio

class Mistral7B(DeepEvalBaseLLM):
# ... (existing code) ...

async def a_generate(self, prompt: str) -> str:
loop = asyncio.get_running_loop()
return await loop.run_in_executor(None, self.generate, prompt)

Some additional considerations and reasons why you should be extra careful with this implementation:

  • Running the generation in a separate thread may not fully utilize GPU resources if the model is GPU-based.
  • There could be potential performance implications of frequently switching between threads.
  • You'd need to ensure thread safety if multiple async generations are happening concurrently and sharing resources.

Lastly, to use your custom Mistral7B model for evaluation:

from deepeval.metrics import AnswerRelevancyMetric
...

metric = AnswerRelevancyMetric(model=mistral_7b)
tip

You need to specify the custom evaluation model you created via the model argument when creating a metric.

Google VertexAI Example

Here is an example of creating a custom Google's Gemini model through langchain's ChatVertexAI module for evaluation:

from langchain_google_vertexai import (
ChatVertexAI,
HarmBlockThreshold,
HarmCategory
)
from deepeval.models.base_model import DeepEvalBaseLLM

class GoogleVertexAI(DeepEvalBaseLLM):
"""Class to implement Vertex AI for DeepEval"""
def __init__(self, model):
self.model = model

def load_model(self):
return self.model

def generate(self, prompt: str) -> str:
chat_model = self.load_model()
return chat_model.invoke(prompt).content

async def a_generate(self, prompt: str) -> str:
chat_model = self.load_model()
res = await chat_model.ainvoke(prompt)
return res.content

def get_model_name(self):
return "Vertex AI Model"

# Initilialize safety filters for vertex model
# This is important to ensure no evaluation responses are blocked
safety_settings = {
HarmCategory.HARM_CATEGORY_UNSPECIFIED: HarmBlockThreshold.BLOCK_NONE,
HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT: HarmBlockThreshold.BLOCK_NONE,
HarmCategory.HARM_CATEGORY_HATE_SPEECH: HarmBlockThreshold.BLOCK_NONE,
HarmCategory.HARM_CATEGORY_HARASSMENT: HarmBlockThreshold.BLOCK_NONE,
HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT: HarmBlockThreshold.BLOCK_NONE,
}

#TODO : Add values for project and location below
custom_model_gemini = ChatVertexAI(
model_name="gemini-1.0-pro-002"
, safety_settings=safety_settings
, project= "<project-id>"
, location= "<region>" #example : us-central1
)

# initiatialize the wrapper class
vertexai_gemini = GoogleVertexAI(model=custom_model_gemini)
print(vertexai_gemini.generate("Write me a joke"))

To use it for evaluation for an LLM-Eval:

from deepeval.metrics import AnswerRelevancyMetric
...

metric = AnswerRelevancyMetric(model=vertexai_gemini)

AWS Bedrock Example

Here is an example of creating a custom AWS Bedrock model through the langchain_community.chat_models module for evaluation:

from langchain_community.chat_models import BedrockChat
from deepeval.models.base_model import DeepEvalBaseLLM

class AWSBedrock(DeepEvalBaseLLM):
def __init__(
self,
model
):
self.model = model

def load_model(self):
return self.model

def generate(self, prompt: str) -> str:
chat_model = self.load_model()
return chat_model.invoke(prompt).content

async def a_generate(self, prompt: str) -> str:
chat_model = self.load_model()
res = await chat_model.ainvoke(prompt)
return res.content

def get_model_name(self):
return "Custom Azure OpenAI Model"

# Replace these with real values
custom_model = BedrockChat(
credentials_profile_name=<your-profile-name>, # e.g. "default"
region_name=<your-region-name>, # e.g. "us-east-1"
endpoint_url=<your-bedrock-endpoint>, # e.g. "https://bedrock-runtime.us-east-1.amazonaws.com"
model_id=<your-model-id>, # e.g. "anthropic.claude-v2"
model_kwargs={"temperature": 0.4},
)

aws_bedrock = AWSBedrock(model=custom_model)
print(aws_bedrock.generate("Write me a joke"))

Finally, supply the newly created aws_bedrock model to LLM-Evals:

from deepeval.metrics import AnswerRelevancyMetric
...

metric = AnswerRelevancyMetric(model=aws_bedrock)

Measuring a Metric

All metrics in deepeval, including custom metrics that you create:

  • can be executed via the metric.measure() method
  • can have its score accessed via metric.score, which ranges from 0 - 1
  • can have its score reason accessed via metric.reason
  • can have its status accessed via metric.is_successful()
  • can be used to evaluate test cases or entire datasets, with or without Pytest.
  • has a threshold that acts as the threshold for success. metric.is_successful() is only true if metric.score is above/below threshold.
  • has a strict_mode property, which enforces metric.score to a binary one

In additional, all metrics in deepeval executes asynchronously by default. This behavior is something you can configure via the async_mode parameter when instantiating a metric.

tip

Visit an individual metric page to learn how they are calculated, and what is required when creating an LLMTestCase in order to execute it.

Here's a quick example.

export OPENAI_API_KEY=<your-openai-api-key>
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase

# Initialize a test case
test_case = LLMTestCase(
input="...",
actual_output="...",
retrieval_context=["..."]
)

# Initialize metric with threshold
metric = AnswerRelevancyMetric(threshold=0.5)

Using this metric, you can either execute it directly as a standalone to get its score and reason:

...

metric.measure(test_case)
print(metric.score)
print(metric.reason)

Or you can either assert a test case using assert_test() via deepeval test run:

test_file.py
from deepeval import assert_test
...

def test_answer_relevancy():
assert_test(test_case, [metric])
deepeval test run test_file.py

Or using the evaluate function:

from deepeval import evaluate
...

evaluate([test_case], [metric])

Measuring Metrics in Async

When a metric's async_mode=True (which is the default value for all metrics), invocations of metric.measure() will execute its internal algorithms concurrently. However, it's important to note that while operations INSIDE measure() executes concurrently, the metric.measure() call itself still blocks the main thread.

info

Let's take the FaithfulnessMetric algorithm for example:

  1. Extract all factual claims made in the actual_output
  2. Extract all factual truths found in the retrieval_context
  3. Compare extracted claims and truths to generate a final score and reason.
from deepeval.metrics import FaithfulnessMetric
...

metric = FaithfulnessMetric(async_mode=True)
metric.measure(test_case)
print("Metric finished!")

When async_mode=True, steps 1 and 2 executes concurrently (ie. at the same time) since they are independent of each other, while async_mode=False will cause steps 1 and 2 to execute sequentially instead (ie. one after the other).

In both cases, "Metric finished!" will wait for metric.measure() to finish running before printing, but setting async_mode to True would make the print statement appear earlier, as async_mode=True allows metric.measure() to run faster.

To measure multiple metrics at once and NOT block the main thread, use the asynchronous a_measure() method instead.

import asyncio
...

# Remember to use async
async def long_running_function():
# These will all run at the same time
await asyncio.gather(
metric1.a_measure(test_case),
metric2.a_measure(test_case),
metric3.a_measure(test_case),
metric4.a_measure(test_case)
)
print("Metrics finished!")

asyncio.run(long_running_function())