lastmileAI logo

In this post

Tutorial: Fine-tuning an Evaluation Metric in 10 Minutes

Written By
Sarmad Qadri
Published on

Evaluating AI systems effectively can be challenging, especially when you need tailored metrics to assess nuanced applications. With AutoEval, you can quickly evaluate application datasets, define evaluation criteria, and fine-tune evaluation models to design your own metric.

In this guide, we'll

  • upload application trace data for a customer service chatbot

  • evaluate performance with pretrained metrics

  • generate labels representing our own "correctness" criteria

  • train a model that learns our criteria and produces a "correctness" score

All code samples shown are available in the AutoEval Getting Started notebook.

Let's get started!

What is AutoEval?

AutoEval is a full-stack developer platform to debug, evaluate and improve LLM applications. It provides out-of-the-box metrics, enables fine-tuning custom evaluators, set up guardrails & monitor app performance:

  • Evaluate datasets with built-in metrics like Faithfulness and Relevance.

  • Label datasets with customizable prompts for tailored assessments.

  • Fine-tune models to adapt to specific evaluation criteria.

Whether you're optimizing an AI system for customer support or developing domain-specific AI applications, AutoEval provides a robust mechanism for measuring performance.

The AutoEval launch post has more context on what and why. Keep reading for more about the how.

Setup and Installation

The AutoEval platform is accessible both via UI console on https://lastmileai.dev, and API (Python, Node, REST). In this guide we'll use the Python SDK.

Account Setup

You'll need a LastMile AI token to get started.

  1. Sign up for a free LastMile AI account

  2. Generate an API key in the dashboard

Install lastmile

Install the lastmile package via pip

pip install lastmile --upgrade

Set API Key

Export the API key you generated as an environment variable, and import it in your code.

export LASTMILE_API_TOKEN="your_api_key"
import os

api_token = os.environ.get("LASTMILE_API_TOKEN")

if not api_token:
    print("Error: Please set your API key in the environment variable LASTMILE_API_KEY")
else:
    print("✓ API key successfully configured!")

Connect your application data

In this example, we have a Q&A application - an AI customer service chatbot for an airline. We are going to measure the quality of the interactions across a set of evaluation criteria. But first, we need to provide that data to AutoEval.

Note: This particular dataset has been synthetically generated, and available for download here.

All we need is data with some combination of the following columns:

  • input: Input to the application (e.g. a user question)

  • output: The response generated by the application (e.g. LLM generation)

  • ground_truth: Factual data, either the ideal correct response, or context used to respond (e.g. data retrieved from a vector DB in a RAG application)

In our chatbot example, these interactions represent several include the customer's query, the assistant's response, and the ground truth containing accurate airline policies and procedures.

Upload dataset via API

from lastmile.lib.auto_eval import AutoEval

client = AutoEval(api_token=api_token)

dataset_csv_path = "data/strikeair_customer_support/strikeair_customer_support_1024.csv"

dataset_id = client.upload_dataset(
    file_path=dataset_csv_path,
    name="Customer Service App Data",
    description="Dataset containing airline customer chatbot questions and responses"
)

print(f"Dataset created with ID: {dataset_id}")

Upload dataset via UI

You can also use the AutoEval console to do everything you can do via API.

To upload, Navigate to Dataset Library and cick + New Dataset

Evaluate with built-in Metrics

This dataset covers a good distribution of interactions that passengers have with an airline chatbot:

  • Flight status inquiries

  • Baggage policy questions

  • Check-in procedures

  • Meal options and special requests

We are going to evaluate how well the assistant's responses align with ground truth (correctness/ground truth), as well as a few more composite metrics.

Let's start by computing some key metrics that come out-of-the-box with AutoEval:

  • Faithfulness - a proxy for hallucination detection, which measures for a given input, how faithful the response is to the provided context or ground truth

  • Answer relevance - how relevant the response is to the input.

The platform has other useful metrics you can get started with, including summarization score, answer correctness, and toxicity.

from lastmile.lib.auto_eval import BuiltinMetrics
default_metrics = [
    BuiltinMetrics.FAITHFULNESS,
    BuiltinMetrics.RELEVANCE,
]

print(f"Evaluation job kicked off")
evaluation_results = client.evaluate_data(
    dataset_id=dataset_id,
    # Alternatively, you can specify 'data' to be a pandas DataFrame
    # data=dataframe,
    metrics=default_metrics,
)

print("Evaluation Results:")
evaluation_results.head(15)

From a cursory look at this data, we can already see that the chatbot's responses are mostly relevant, but in some cases it is hallucinating (faithfulness is very low).


More Posts