Skip to main content

Improving your Hyperparameters

If you've successfully run your evaluation, you'll be taken to the automatically generated test report on Confident AI. There, we'll analyze the results and use those insights to refine our hyperparameters.

note

If you haven't done so, please login to Confident AI go back to the previous section to re-run your evaluations:

deepeval login

Analying your Test Report

Once your evaluation is complete, you'll be redirected to the following page on Confident AI. Here, you'll find a detailed testing report for the test cases we defined and evaluated in the previous section. Each test case includes its status (pass or fail), input, and actual output. Clicking on a test case will reveal additional parameters and metric scores.

Let's inspect the test report:

We can see that Faithfulness is a significant issue, as our QA system often responds speculatively to questions that go beyond the knowledge base. Answer Relevancy is less of a concern but still occurs in some cases.

Improving your Hyperparameters

Let's refine our hyperparameters by adjusting the prompt. The testing report shows that failures aren't caused by insufficient relevant context but rather by the required information being outside the knowledge base itself. As a result, increasing top-k won’t help.

This was our original prompt template that was defined in the introduction section:

prompt_template = """You are a helpful QA Agent designed to answer user questions
about a company's products and services. Your goal is to provide accurate, relevant,
and well-structured responses based on the information retrieved from the company's
knowledge base.

Let's revise this prompt template and add a few notes to ensure it doesn't speculate:

prompt_template = """You are a helpful QA Agent designed to answer user questions
about a company's products and services. Your goal is to provide accurate, relevant,
and well-structured responses based on the information retrieved from the company's
knowledge base.

- Always ensure your answers are factually correct and free from speculation.
- If the requested information is not available in the knowledge base, state that you do not have sufficient information rather than making assumptions.
- Keep responses concise, clear, and user-friendly.

Additionally, let's use gpt-4o to ensure that our QA Agent is truly adhering to the instructions, as AnswerRelevancy scores weren't perfect despite the original prompt specifying the need to generate relevant responses.

Re-running Evaluations with Improved Hyperparameters

With these parameters, let's rerun our evaluations to see whether these changes lead to improvements.

for golden in dataset.goldens:
# Compute actual output and retrieval context
actual_output = qa_agent.generate(golden.input) # Replace with logic to compute actual output
retrieval_context = qa_agent.get_retrieval_context() # Replace with logic to compute retrieval context

dataset.add_test_case(
LLMTestCase(
input=golden.input,
actual_output=actual_output,
retrieval_context=retrieval_context
)
)

evaluate(
dataset,
metrics=[answer_relevancy_metric, faithfulness_metric],
hyperparameters={"model": model, "prompt template": prompt_template, "top-k": top_k}
)

The test report below diplays that most of the test cases are now failing, although some are still failing.

You'll need to keep iterating on your hyperparameters and following the steps on this page repeatedly until you successfully pass all the test cases. After that, you can introduce metrics and synthetic data into your evaluation pipeline.