In this post
Tutorial: Fine-tuning an Evaluation Metric in 10 Minutes
Evaluating AI systems effectively can be challenging, especially when you need tailored metrics to assess nuanced applications. With AutoEval, you can quickly evaluate application datasets, define evaluation criteria, and fine-tune evaluation models to design your own metric.
In this guide, we'll
upload application trace data for a customer service chatbot
evaluate performance with pretrained metrics
generate labels representing our own "correctness" criteria
train a model that learns our criteria and produces a "correctness" score
All code samples shown are available in the AutoEval Getting Started notebook.
Let's get started!
AutoEval is a full-stack developer platform to debug, evaluate and improve LLM applications. It provides out-of-the-box metrics, enables fine-tuning custom evaluators, set up guardrails & monitor app performance:
Evaluate datasets with built-in metrics like Faithfulness and Relevance.
Label datasets with customizable prompts for tailored assessments.
Fine-tune models to adapt to specific evaluation criteria.
Whether you're optimizing an AI system for customer support or developing domain-specific AI applications, AutoEval provides a robust mechanism for measuring performance.
The AutoEval launch post has more context on what and why. Keep reading for more about the how.
The AutoEval platform is accessible both via UI console on https://lastmileai.dev, and API (Python, Node, REST). In this guide we'll use the Python SDK.
Account Setup
You'll need a LastMile AI token to get started.
Sign up for a free LastMile AI account
Generate an API key in the dashboard
Install lastmile
Install the lastmile package via pip
pip install lastmile --upgrade
Set API Key
Export the API key you generated as an environment variable, and import it in your code.
export LASTMILE_API_TOKEN="your_api_key"
import os
api_token = os.environ.get("LASTMILE_API_TOKEN")
if not api_token:
print("Error: Please set your API key in the environment variable LASTMILE_API_KEY")
else:
print("✓ API key successfully configured!")
In this example, we have a Q&A application - an AI customer service chatbot for an airline. We are going to measure the quality of the interactions across a set of evaluation criteria. But first, we need to provide that data to AutoEval.
Note: This particular dataset has been synthetically generated, and available for download here.
All we need is data with some combination of the following columns:
input
: Input to the application (e.g. a user question)output
: The response generated by the application (e.g. LLM generation)ground_truth
: Factual data, either the ideal correct response, or context used to respond (e.g. data retrieved from a vector DB in a RAG application)
In our chatbot example, these interactions represent several include the customer's query, the assistant's response, and the ground truth containing accurate airline policies and procedures.
Upload dataset via API
from lastmile.lib.auto_eval import AutoEval
client = AutoEval(api_token=api_token)
dataset_csv_path = "data/strikeair_customer_support/strikeair_customer_support_1024.csv"
dataset_id = client.upload_dataset(
file_path=dataset_csv_path,
name="Customer Service App Data",
description="Dataset containing airline customer chatbot questions and responses"
)
print(f"Dataset created with ID: {dataset_id}")
Upload dataset via UI
You can also use the AutoEval console to do everything you can do via API.
To upload, Navigate to Dataset Library and cick + New Dataset
This dataset covers a good distribution of interactions that passengers have with an airline chatbot:
Flight status inquiries
Baggage policy questions
Check-in procedures
Meal options and special requests
We are going to evaluate how well the assistant's responses align with ground truth (correctness/ground truth), as well as a few more composite metrics.
Let's start by computing some key metrics that come out-of-the-box with AutoEval:
Faithfulness - a proxy for hallucination detection, which measures for a given input, how faithful the response is to the provided context or ground truth
Answer relevance - how relevant the response is to the input.
The platform has other useful metrics you can get started with, including summarization score, answer correctness, and toxicity.
from lastmile.lib.auto_eval import BuiltinMetrics
default_metrics = [
BuiltinMetrics.FAITHFULNESS,
BuiltinMetrics.RELEVANCE,
]
print(f"Evaluation job kicked off")
evaluation_results = client.evaluate_data(
dataset_id=dataset_id,
# Alternatively, you can specify 'data' to be a pandas DataFrame
# data=dataframe,
metrics=default_metrics,
)
print("Evaluation Results:")
evaluation_results.head(15)
From a cursory look at this data, we can already see that the chatbot's responses are mostly relevant, but in some cases it is hallucinating (faithfulness is very low).