Online LLM Evaluations
To monitor how your LLM application is performing over time, and be alerted of any unsatisfactory LLM responses in production, head to the home page via the left navigation drawer to turn on the metrics you wish to enable in production. This process is incredibly helpful for identifying failing responses, serving as a preliminary filter for unsatisfactory responses that require further discussion.
Confident AI will automatically run evaluations for the enabled metrics for all incoming responses.
Confident AI supports multiple default real-time evaluation metrics, including:
- Answer Relevancy
- Faithfulness
- Retreival Quality
Additionally, Confident AI supports G-Eval metrics for ANY custom use case.
Creating A Custom Metric
To run real-time evaluations using a custom G-Eval metric, first create your custom metric by clicking the create custom metric button.
Here you can create custom criteria, thresholds, strict modes, reasoning settings, and select the evaluation parameters that are logged during monitoring should be used for evaluation (these are analogous to the LLMTestCaseParams
used in G-Eval.). Please note that this setup does not include the expected_output
and context
, as these parameters are unavailable for online evaluations.
Viewing Evaluations
Easily view your real-time evaluation results on the provided graph in the home page. Each data point represents the average metric score for each metric across all responses of each monitoring day.