Skip to main content

Custom Metrics

deepeval allows you to implement your own evaluator (for example, your own GPT evaluator) by creating a custom metric. All custom metrics are automatically integrated with the deepeval ecosystem, which includes Confident AI.

Required Arguments

To use a custom metric, you'll have to provide the following arguments when creating an LLMTestCase:

  • input
  • actual_output

You'll also need to supply any additional arguments such as expected_output and context if your custom metric's measure() method is dependent on these parameters.

Implementation

To create a custom metric, you'll need to inherite deepeval's BaseMetric class, and implement abstract methods and properties such as measure(), is_successful(), and name(). Here's an example:

from deepeval.metrics import BaseMetric
from deepeval.test_case import LLMTestCase

# Inherit BaseMetric
class LatencyMetric(BaseMetric):
# This metric by default checks if the latency is greater than 10 seconds
def __init__(self, max_seconds: int=10):
self.threshold = max_seconds

def measure(self, test_case: LLMTestCase):
# Set self.success and self.score in the "measure" method
self.success = test_case.latency <= self.threshold
if self.success:
self.score = 1
else:
self.score = 0

# You can also optionally set a reason for the score returned.
# This is particularly useful for a score computed using LLMs
self.reason = "Too slow!"
return self.score

def is_successful(self):
return self.success

@property
def __name__(self):
return "Latency"

Notice that a few things has happened:

  • self.threshold was set in __init__(), and this can be either a minimum or maximum threshold
  • self.success, self.score, and self.reason was set in measure()
  • measure() takes in an LLMTestCase
  • is_successful() simply returns the success status
  • __name()__ simply returns a string representing the metric name

To create a custom metric without unexpected errors, we recommend you set the appropriate class variables in the appropriate methods as outlined above. You should also note that self.reason is optional. self.reason should be a string representing the rationale behind an LLM computed score. This is only applicable if you're using LLMs as an evaluator in the measure() method, and has implemented a way to generate a score reasoning.

After creating a custom metric, you can use it in the same way as all of deepeval's metrics:

from deepeval import evaluate
from deepeval.test_case import LLMTestCase
...

# Note that we pass in latency since the measure method requires it for evaluation
latency_metric = LatencyMetric(max_seconds=10.0)
test_case = LLMTestCase(input="...", actual_output="...", latency=8.3)
evaluate([test_case], [latency_metric])