Building Custom LLM Metrics

In deepeval, anyone can easily build their own custom LLM evaluation metric that is automatically integrated within deepeval's ecosystem, which includes:

Running your custom metric in CI/CD pipelines.
Taking advantage of deepeval's capabilities such as metric caching and multi-processing.
Have custom metric results automatically sent to Confident AI.

Here are a few reasons why you might want to build your own LLM evaluation metric:

You want greater control over the evaluation criteria used (and you think GEval is insufficient).
You don't want to use an LLM for evaluation (since all metrics in deepeval are powered by LLMs).
You wish to combine several deepeval metrics (eg., it makes a lot of sense to have a metric that checks for both answer relevancy and faithfulness).

info

There are many ways one can implement an LLM evaluation metric. Here is a great article on everything you need to know about scoring LLM evaluation metrics.

Rules To Follow When Creating A Custom Metric

1. Inherit the `BaseMetric` class

To begin, create a class that inherits from deepeval's BaseMetric class:

from deepeval.metrics import BaseMetric

class CustomMetric(BaseMetric):
    ...

This is important because the BaseMetric class will help deepeval acknowledge your custom metric during evaluation.

2. Implement the `init()` method

The BaseMetric class gives your custom metric a few properties that you can configure and be displayed post-evaluation, either locally or on Confident AI.

An example is the threshold property, which determines whether the LLMTestCase being evaluated has passed or not. Although the threshold property is all you need to make a custom metric functional, here are some additional properties for those who want even more customizability:

evaluation_model: a str specifying the name of the evaluation model used.
include_reason: a bool specifying whether to include a reason alongside the metric score. This won't be needed if you don't plan on using an LLM for evaluation.
strict_mode: a bool specifying whether to pass the metric only if there is a perfect score.
async_mode: a bool specifying whether to execute the metric asynchronously.

tip

Don't read too much into the advanced properties for now, we'll go over how they can be useful in later sections of this guide.

The __init__() method is a great place to set these properties:

from deepeval.metrics import BaseMetric

class CustomMetric(BaseMetric):
    def __init__(
        self,
        threshold: float = 0.5,
        # Optional
        evaluation_model: str,
        include_reason: bool = True,
        strict_mode: bool = True,
        async_mode: bool = True
    ):
        self.threshold = threshold
        # Optional
        self.evaluation_model = evaluation_model
        self.include_reason = include_reason
        self.strict_mode = strict_mode
        self.async_mode = async_mode

3. Implement the `measure()` and `a_measure()` methods

The measure() and a_measure() method is where all the evaluation happens. In deepeval, evaluation is the process of applying a metric to an LLMTestCase to generate a score and optionally a reason for the score (if you're using an LLM) based on the scoring algorithm.

The a_measure() method is simply the asynchronous implementation of the measure() method, and so they should both use the same scoring algorithm.

info

The a_measure() method allows deepeval to run your custom metric asynchronously. Take the assert_test function for example:

from deepeval import assert_test

def test_multiple_metrics():
    ...
    assert_test(test_case, [metric1, metric2], run_async=True)

When you run assert_test() with run_async=True (which is the default behavior), deepeval calls the a_measure() method which allows all metrics to run concurrently in a non-blocking way.

Both measure() and a_measure() MUST:

accept an LLMTestCase as argument
set self.score
set self.success

You can also optionally set self.reason in the measure methods (if you're using an LLM for evaluation), or wrap everything in a try block to catch any exceptions and set it to self.error. Here's a hypothetical example:

from deepeval.metrics import BaseMetric
from deepeval.test_case import LLMTestCase

class CustomMetric(BaseMetric):
    ...

    def measure(self, test_case: LLMTestCase) -> float:
        # Although not required, we recommend catching errors
        # in a try block
        try:
            self.score = generate_hypothetical_score(test_case)
            if self.include_reason:
                self.reason = generate_hypothetical_reason(test_case)
            self.success = self.score >= self.threshold
            return self.score
        except Exception as e:
            # set metric error and re-raise it
            self.error = str(e)
            raise

    async def a_measure(self, test_case: LLMTestCase) -> float:
        # Although not required, we recommend catching errors
        # in a try block
        try:
            self.score = await async_generate_hypothetical_score(test_case)
            if self.include_reason:
                self.reason = await async_generate_hypothetical_reason(test_case)
            self.success = self.score >= self.threshold
            return self.score
        except Exception as e:
            # set metric error and re-raise it
            self.error = str(e)
            raise

tip

Often times, the blocking part of an LLM evaluation metric stems from the API calls made to your LLM provider (such as OpenAI's API endpoints), and so ultimately you'll have to ensure that LLM inference can indeed be made asynchronous.

If you've explored all your options and realize there is no asynchronous implementation of your LLM call (eg., if you're using an open-source model from Hugging Face's transformers library), simply reuse the measure method in a_measure():

from deepeval.metrics import BaseMetric
from deepeval.test_case import LLMTestCase

class CustomMetric(BaseMetric):
    ...

    async def a_measure(self, test_case: LLMTestCase) -> float:
        return self.measure(test_case)

You can also click here to find an example of offloading LLM inference to a separate thread as a workaround, although it might not work for all use cases.

4. Implement the `is_successful()` method

Under the hood, deepeval calls the is_successful() method to determine the status of your metric for a given LLMTestCase. We recommend copy and pasting the code below directly as your is_successful() implementation:

from deepeval.metrics import BaseMetric
from deepeval.test_case import LLMTestCase

class CustomMetric(BaseMetric):
    ...

    def is_successful(self) -> bool:
        if self.error is not None:
            self.success = False
        else:
            return self.success

5. Name Your Custom Metric

Probably the easiest step, all that's left is to name your custom metric:

from deepeval.metrics import BaseMetric
from deepeval.test_case import LLMTestCase

class CustomMetric(BaseMetric):
    ...

    @property
    def __name__(self):
        return "My Custom Metric"

Congratulations 🎉! You've just learnt how to build a custom metric that is 100% integrated with deepeval's ecosystem. In the following section, we'll go through a few real-life examples.

Building a Custom Non-LLM Eval

An LLM-Eval is an LLM evaluation metric that is scored using an LLM, and so a non-LLM eval is simply a metric that is not scored using an LLM. In this example, we'll demonstrate how to use the rouge score instead:

from deepeval.scorer import Scorer
from deepeval.metrics import BaseMetric
from deepeval.test_case import LLMTestCase

class RougeMetric(BaseMetric):
    def __init__(self, threshold: float = 0.5):
        self.threshold = threshold
        self.scorer = Scorer()

    def measure(self, test_case: LLMTestCase):
        self.score = self.scorer.rouge_score(
            prediction=test_case.actual_output,
            target=test_case.expected_output,
            score_type="rouge1"
        )
        self.success = self.score >= self.threshold
        return self.score

    # Async implementation of measure(). If async version for
    # scoring method does not exist, just reuse the measure method.
    async def a_measure(self, test_case: LLMTestCase):
        return self.measure(test_case)

    def is_successful(self):
        return self.success

    @property
    def __name__(self):
        return "Rouge Metric"

note

Although you're free to implement your own rouge scorer, you'll notice that while not documented, deepeval additionally offers a scorer module for more traditional NLP scoring method and can be found here.

Be sure to run pip install rouge-score if rouge-score is not already installed in your environment.

You can now run this custom metric as a standalone in a few lines of code:

...

#####################
### Example Usage ###
#####################
test_case = LLMTestCase(input="...", actual_output="...", expected_output="...")
metric = RougeMetric()

metric.measure(test_case)
print(metric.is_successful())

Building a Custom Composite Metric

In this example, we'll be combining two default deepeval metrics as our custom metric, hence why we're calling it a "composite" metric.

We'll be combining the AnswerRelevancyMetric and FaithfulnessMetric, since we rarely see a user that cares about one but not the other.

from deepeval.metrics import BaseMetric, AnswerRelevancyMetric, FaithfulnessMetric
from deepeval.test_case import LLMTestCase

class FaithfulRelevancyMetric(BaseMetric):
    def __init__(
        self,
        threshold: float = 0.5,
        evaluation_model: Optional[str] = "gpt-4-turbo",
        include_reason: bool = True,
        async_mode: bool = True,
        strict_mode: bool = False,
    ):
        self.threshold = 1 if strict_mode else threshold
        self.evaluation_model = evaluation_model
        self.include_reason = include_reason
        self.async_mode = async_mode
        self.strict_mode = strict_mode

    def measure(self, test_case: LLMTestCase):
        try:
            relevancy_metric, faithfulness_metric = initialize_metrics()
            # Remember, deepeval's default metrics follow the same pattern as your custom metric!
            relevancy_metric.measure(test_case)
            faithfulness_metric.measure(test_case)

            # Custom logic to set score, reason, and success
            set_score_reason_success(relevancy_metric, faithfulness_metric)
            return self.score
        except Exception as e:
            # Set and re-raise error
            self.error = str(e)
            raise

    async def a_measure(self, test_case: LLMTestCase):
        try:
            relevancy_metric, faithfulness_metric = initialize_metrics()
            # Here, we use the a_measure() method instead so both metrics can run concurrently
            await relevancy_metric.a_measure(test_case)
            await faithfulness_metric.a_measure(test_case)

            # Custom logic to set score, reason, and success
            set_score_reason_success(relevancy_metric, faithfulness_metric)
            return self.score
        except Exception as e:
            # Set and re-raise error
            self.error = str(e)
            raise

    def is_successful(self) -> bool:
        if self.error is not None:
            self.success = False
        else:
            return self.success

    @property
    def __name__(self):
        return "Composite Relevancy Faithfulness Metric"


    ######################
    ### Helper methods ###
    ######################
    def initialize_metrics(self):
        relevancy_metric = AnswerRelevancyMetric(
            threshold=self.threshold,
            model=self.evaluation_model,
            include_reason=self.include_reason,
            async_mode=self.async_mode,
            strict_mode=self.strict_mode
        )
        faithfulness_metric = FaithfulnessMetric(
            threshold=self.threshold,
            model=self.evaluation_model,
            include_reason=self.include_reason,
            async_mode=self.async_mode,
            strict_mode=self.strict_mode
        )
        return relevancy_metric, faithfulness_metric

    def set_score_reason_success(
        self,
        relevancy_metric: BaseMetric,
        faithfulness_metric: BaseMetric
    ):
        # Get scores and reasons for both
        relevancy_score = relevancy_metric.score
        relevancy_reason = relevancy_metric.reason
        faithfulness_score = faithfulness_metric.score
        faithfulness_reason = faithfulness_reason.reason

        # Custom logic to set score
        composite_score = min(relevancy_score, faithfulness_score)
        self.score = 0 if self.strict_mode and composite_score < self.threshold else composite_score

        # Custom logic to set reason
        if include_reason:
            self.reason = relevancy_reason + "\n" + faithfulness_reason

        # Custom logic to set success
        self.success = self.score >= self.threshold

Now go ahead and try to use it:

test_llm.py
from deepeval import assert_test
from deepeval.test_case import LLMTestCase
...

def test_llm():
    metric = FaithfulRelevancyMetric()
    test_case = LLMTestCase(...)
    assert_test(test_case, [metric])

deepeval test run test_llm.py

Rules To Follow When Creating A Custom Metric​

1. Inherit the BaseMetric class​

2. Implement the __init__() method​

3. Implement the measure() and a_measure() methods​

4. Implement the is_successful() method​

5. Name Your Custom Metric​

Building a Custom Non-LLM Eval​

Building a Custom Composite Metric​