LLM Application Monitoring in Production

Quick Summary

deepeval allows you to live monitor LLM responses in production, which you can enable real-time evaluations on. By monitoring responses, you can leverage our hosted evaluation infrastructure to identify unsatisfactory LLM responses, have human annotators send feedback to such responses, and improve your evaluation dataset over time.

Monitoring Live Responses

To monitor LLM responses, use the deepeval.track(...) method in your LLM application to start tracking responses.

import deepeval

# At the end of your LLM call,
# usually in your backend API handler
deepeval.track(
    event_name="Chatbot",
    model="gpt-4",
    input="input",
    response="response"
)

There are four mandatory and ten optional parameters when using the track() function to monitor responses in production:

event_name: type str specifying the type of event tracked.
model: type str specifying the name of the LLM model used.
input: type str.
response: type str.
[Optional] retrieval_context: type list[str] that indicates the context that were retrieved in your RAG pipeline.
[Optional] additional_data: type dict[str, Union[str, dict, Link]]. See below for further details.
[Optional] hyperparameters: type dict[str, Union[str, int, float]]. You can provide a dictionary to specify any additional hyperparamters used to generate the response.
[Optional] distinct_id: type str to identify end users using your LLM application.
[Optional] conversation_id: type str to group together multiple messages under a single conversation thread.
[Optional] completion_time: type float that indicates how many seconds it took your LLM application to complete.
[Optional] token_usage: type float
[Optional] token_cost: type float
[Optional] fail_silently: type bool. You should set this to False in development to check if tracking is working properly. Defaulted to False.
[Optional] raise_expection: type bool. You should set this to False in production if you don't want to raise expections in production. Defaulted to True.

caution

Please do NOT provide placeholder values for optional parameters. Leave it blank instead.

The track() function returns an event_id upon a successful API request to Confident's servers, which you can later use to send human feedback regarding a particular LLM response you've tracked.

import deepeval

event_id = deepeval.track(...)

Congratulations! With a few lines of code, deepeval will now automatically log all LLM responses in production to Confident AI.

Tracking Custom Hyperparameters

In addition to tracking which model was used to generate each response, you can also associate any custom hyperparameters you wish to each response you're tracking.

import deepeval

deepeval.track(
    ...
    model="gpt-4",
    hyperparameters={
        "prompt template": "...",
        "temperature": 1,
        "chunk size": 500
    }
)

info

Logging hyperparameters allows you to more easily filter and search for different responses on Confident AI.

Tracking Additional Custom Data

Similar to hyperparameters, you can easily associate custom additional data for each response.

import deepeval
from deepeval.events.api import Link

deepeval.track(
    ...
    additional_data={
        "Example Text": "...",
        # the Link class allows you to access the link directly on Confident AI
        "Example Link": Link(value="https://www.youtube.com"),
        "Example list of Links": [Link(value="https://www.instagram.com")],
        "Example JSON": {"Any Key": "Any Value"}
    },
)

Note that you can log either text, a Link, list of Links, or any custom dict (as shown in the "Example Json"). Similar to hyperparameter, you can also filter and search for different responses based on the provided custom data.

Did you know?

Although you can technically log a link as a native string, the Link(value="...") class allows you to access said link directly on Confident AI. You should also aim to have your Links start with 'https://...'.

Sending Human Feedback

deepeval allows you to send human feedback for a particular LLM response by specifying an event_id returned through the track() function.

There are two types of feedback:

User provided feedback. A user feedback, is feedback provided by your end users. If you're building a discord bot for example, this will likely be the users interacting with your discord bot on a daily basis.
Reviewer provided feedback. A reviewer feedback, is feedback provided by your LLM quality assurance personnel, whoever they might be. These might be data annotators and labellers, or even domain experts that checking the quality of LLM responses in production.

Using the event_id returned from track(), here's how you can send feedback to Confident:

import deepeval

event_id = deepeval.track(...)

deepeval.send_feedback(
    event_id=event_id,
    provider="user",
    rating=6,
    explanation="...",
    expected_response="..."
)

There are three mandatory and four optional parameters when using the send_feedback() function:

event_id: a string representing the event_id returned from the deepeval.track() function.
provider: a string of either "user" OR "reviewer" to specify the type of feedback provider.
rating: an integer ranging from 1 - 5, inclusive.
[Optional] explanation: a string that serves as the explanation for the given rating.
[Optional] expected_response: a string representing what the ideal response is.
[Optional] fail_silently: a boolean which when set to True will neither print nor raise any exceptions on error. Defaulted to False.
[Optional] raise_expection: a boolean which when set to True will not raise any expections on error. Defaulted to True.

Responses on Confident AI

Confident offers an observatory to view responses and identify ones where you want to augment your evaluation dataset with.

If you're building an LLM chatbot, you can also view entire conversation threads via the conversation_id.

Enable Real-Time Evals

To monitor how your LLM application is performing over time, and be alerted of any unsatisfactory LLM responses in production, head to the "projects details" section via the left navigation drawer to turn on the metrics you wish to enable in production. Confident AI will automatically run evaluations for enabled metrics for all incoming events.

Quick Summary​

Monitoring Live Responses​

Tracking Custom Hyperparameters​

Tracking Additional Custom Data​

Sending Human Feedback​

Responses on Confident AI​

Enable Real-Time Evals​