Iterating on Hyperparameters

In this section, we’ll be iterating on our medical chatbot’s hyperparameters to improve its performance. Our focus will be on three specific hyperparameters: model, system prompt, and temperature.

note

Although we’re primarily focusing on these 3 hyperparameters, there are many other hyperparameters to optimize, such as chunk_size, embedding_model, and top-k for the retriever, among others.

Quick Recap

Before we begin, let's briefly recap what we covered in the last section. We conducted an evaluation of our medical chatbot, analyzing its responses across various test cases (2 passing and 3 failing) to identify areas for improvement. This evaluation highlighted several key shortcomings:

Query Relevance: The chatbot occasionally failed to directly address user queries, even when adhering to its mission.
Professionalism: Responses sometimes lacked the empathetic tone expected in a medical setting, compromising professionalism.
Factual Accuracy: The chatbot occasionally constructed inaccurate responses, even when using accurate retrieved knowledge.

As a reminder in our medical chatbot example, we utilized the following model, temperature, and system prompt configurations:

MODEL = "gpt-3.5"
TEMPERATURE = 1.0
SYSTEM_PROMPT = """You are an expert in medical diagnosis and are now connected to the patient
booking system. The first step is to create the appointment and record the symptoms. Ask for specific
symptoms! Then after the symptoms have been created, make a diagnosis. Do not stop the diagnosis until
you narrow down the exact specific underlying medical condition. After the diagnosis has been recorded,
be sure to record the name, date, and email in the appointment as well. Only enter the name, date, and
email that the user has explicitly provided. Update the symptoms and diagnosis in the appointment.
Once all information has been recorded, confirm the appointment."""

info

For most use cases, using a more advanced model and a lower temperature typically leads to more accurate and consistent results.

Since iterating on our model and temperature is as simple as changing the model name and temperature settings, we'll be focusing on improving the prompt template.

Improving Your System Prompt

In the code below, we've revised the system prompt to ensure the chatbot can dynamically adapt to user input while completing all necessary tasks and maintaining a professional tone. This approach aims to improve the chatbot’s ability to handle varying user intents and align responses with medical standards.

SYSTEM_PROMPT = """You are a highly skilled medical professional assisting with patient diagnosis and appointment scheduling.
Your primary goal is to address the user's immediate request first, while ensuring all required information is eventually gathered.
Follow these steps to handle the interaction effectively:
  1. Identify the user's intent. If they start by requesting an appointment, prioritize collecting their name, date, and email first, then ask about their symptoms.
  2. If the user begins by discussing symptoms, gather detailed and accurate information about their concerns first, then transition to booking the appointment.
  3. Record all collected information (name, date, email, symptoms, and diagnosis) accurately and concisely in the patient booking system.
  4. Provide a diagnosis based on the recorded symptoms, ensuring factual accuracy and avoiding unsupported statements.
  5. Maintain a professional and empathetic tone throughout the conversation, acknowledging the user's concerns appropriately.
  6. Summarize all gathered information, including the appointment details and diagnosis, and confirm everything with the user before finalizing the interaction.
"""

Here are they key adjustments we've made to address each key shortcoming:

Query Relevance: The chatbot now adapts dynamically to user intent, addressing immediate needs first while ensuring both symptom collection and appointment booking are completed. This ensures responses are tailored and relevant to the user's input.
Professionalism: The updated prompt reinforces maintaining a professional and empathetic tone throughout the interaction, ensuring the chatbot aligns with the expectations of a medical setting.
Factual Accuracy: The revised instructions guide the chatbot to base all responses on accurate retrieved knowledge, explicitly avoiding unsupported or incorrect statements.

Let's evaluate our updated medical chatbot with the new hyperparameters by re-running the evaluation on the 5 test cases and 3 metrics from the previous section.

Re-Evaluating Your LLM Application

To evaluate your LLM application on the new changes, we’ll need to generate outputs for our 5 example user queries with the updated hyperparameter configuration to construct our updated test cases, and re-run the evaluate() function to generate another testing report on Confident AI.

from deepeval import evaluate

...
evaluate(
  # dataset should be updated with new outputs from your new system prompt
  dataset,
  metrics=[answer_relevancy_metric, faithfulness_metric, professionalism_metric],
  skip_on_missing_params=True,
  hyperparameters={"model": MODEL, "prompt template": SYSTEM_PROMPT, "temperature": TEMPERATURE}
)

caution

Although we've pre-computed the actual outputs and retrieval context, you will need to populate these fields at evaluation time by running the function that calls your LLM application (e.g. chatbot.generate(..)).

Let's take a look at our new evaluation results. You can see that most of the test cases are now passing except for the first test case.

Analyzing this test case further reveals that, while our previous model configuration passed Faithfulness and Professionalism for this specific test case, these metrics are now failing.

We've iterated on our chatbot's hyperparameters and significantly improved its performance, increasing the number of passing test cases from 2 to 4. However, one test case that was previously passing has now started failing, and this is known as a regression

In the next section, we'll learn how to compare evaluation results to identify improvements and regressions in your LLM application on Confident AI.

tip

The ability to identify any regressions is incredibly helpful in safeguarding against breaking changes in your LLM application.

Quick Recap​

Improving Your System Prompt​

Re-Evaluating Your LLM Application​

Quick Recap

Improving Your System Prompt

Re-Evaluating Your LLM Application