Skip to main content

Tutorial Introduction

DeepEval is the open-source LLM evaluation framework and in this complete end-to-end tutorial, we'll show you exactly how you can use DeepEval to improve your LLM application one step at a time. This tutorial will walk you through how to evaluate and test your LLM application all the way from the initial development stages to post-production.

For LLM evaluation in development, we'll cover:

  • How to choose your LLM evaluation metrics and use them in deepeval
  • How to run evaluations in deepeval to quantify LLM application performance
  • How to use evaluation results to identify system hyperparameters (such as LLMs and prompts) to iterate on
  • How to make your evaluation results more robust by scaling it out to cover more edge cases

Once your LLM is ready for deployment, for LLM evaluation in production, we'll cover:

  • How to continously evaluate your LLM application in production (post-deployment, online evaluation)
  • How to use evaluation data in production to A/B test different system hyperparameters (such as LLMs and prompts)
  • How to use production data to improve your development evaluation workflow over time
tip

Just because your LLM application is in production doesn't mean you don't need LLM evaluation during development, and the same is true the other way around.

Quick Terminologies

Before diving into the tutorial, let's go over the terminology used commonly used in LLM evaluation:

  • Hyperparameters: this refers to the parameters that make up your LLM system. Some examples include system prompts, user prompts, models used for generation, temperature, chunk size (for RAG), etc.
  • System Prompt: this refers to the prompt that sets the overarching instructions that define how your LLM should behave across all interactions.
  • Generation model: this refers to the model used to generate LLM responses based on some input, and also the LLM to be evaluated. We'll be referring to this as simply model throughout this tutorial.
  • Evaluation model: this referes to the LLM used for evaluation, NOT the LLM to be evaluated.

What Will We Be Evaluating?

We'll be evaluating an LLM medical chatbot in this tutorial for demonstration purposes, and as you'll see in later sections, we'll start by building one ourselves before showing how you can improve your system prompts and models through evaluation results to iterate on your systemm prompt and model.

note

Your use case is most likely not a medical chatbot, or even chatbot related, but that's OK. The concept is the same for all use cases - you pick a criteria, you use the metrics deepeval offers based on your criteria, and you iterate based on the results of these evaluations.

Who Is This Tutorial For?

If you're building applications powered by LLMs, this tutorial is for you. Why? Because LLMs are prone to errors, and this tutorial will teach you exactly how to improve your LLM systems through a systematic evaluation-guided, data-first approach.