Skip to main content

Online LLM Evaluations

To monitor how your LLM application is performing over time, and be alerted of any unsatisfactory LLM responses in production, head to the home page via the left navigation drawer to turn on the metrics you wish to enable in production. This process is incredibly helpful for identifying failing responses, serving as a preliminary filter for unsatisfactory responses that require further discussion.

info

Confident AI will automatically run evaluations for the enabled metrics for all incoming responses.

Confident AI supports multiple default real-time evaluation metrics, including:

Additionally, Confident AI supports G-Eval metrics for ANY custom use case.

Creating A Custom Metric

To run real-time evaluations using a custom G-Eval metric, first create your custom metric by clicking the create custom metric button.

ok

Here you can create custom criteria, thresholds, strict modes, reasoning settings, and select the evaluation parameters that are logged during monitoring should be used for evaluation (these are analogous to the LLMTestCaseParams used in G-Eval.). Please note that this setup does not include the expected_output and context, as these parameters are unavailable for online evaluations.

ok

Viewing Evaluations

Easily view your real-time evaluation results on the provided graph in the home page. Each data point represents the average metric score for each metric across all responses of each monitoring day.

ok