In this post
Tutorial: Fine-tuning an Evaluation Metric in 10 Minutes
Evaluating AI systems effectively can be challenging, especially when you need tailored metrics to assess nuanced applications. With AutoEval, you can quickly evaluate application datasets, define evaluation criteria, and fine-tune your own evaluation metrics.
In this practical guide, we'll
upload application trace data for a customer service chatbot
evaluate performance with pretrained metrics
generate labels representing our own "correctness" criteria
train a model that learns our criteria and produces a "correctness" score
All code samples shown are available in the AutoEval Getting Started notebook.
Let's get started!
AutoEval is a full-stack developer platform to debug, evaluate and improve LLM applications. It provides out-of-the-box metrics, enables fine-tuning custom evaluators, set up guardrails & monitor app performance:
Evaluate datasets with built-in metrics like Faithfulness and Relevance.
Label datasets with customizable prompts for tailored assessments.
Fine-tune models to adapt to specific evaluation criteria.
Whether you're optimizing an AI system for customer support or developing domain-specific AI applications, AutoEval provides a robust mechanism for measuring performance.
The AutoEval launch post has more context on what and why. This post covers the how.
The AutoEval platform is accessible both via UI console on https://lastmileai.dev, and API (Python, Node, REST). In this guide we'll use the Python SDK.
Account Setup
You'll need a LastMile AI token to get started.
Sign up for a free LastMile AI account
Generate an API key in the dashboard
Install lastmile
Install the lastmile
package via pip
. In this guide we use Python, but you can use Node.js (TypeScript) using the lastmile
package on npm
pip install lastmile --upgrade
Set API Key
Export the API key you generated as an environment variable, and import it in your code.
export LASTMILE_API_TOKEN="your_api_key"
import os
api_token = os.environ.get("LASTMILE_API_TOKEN")
if not api_token:
print("Error: Please set your API key in the environment variable LASTMILE_API_KEY")
else:
print("✓ API key successfully configured!")
In this example, we have a Q&A application - an airline's AI customer service chatbot. We are going to measure the quality of the interactions across a set of evaluation criteria.
But first, we need to provide that data to AutoEval.
Note: This particular dataset has been synthetically generated, and available for download here.
All we need is data with some combination of the following columns:
input
: Input to the application (e.g. a user question)output
: The response generated by the application (e.g. LLM generation)ground_truth
: Factual data, either the ideal correct response, or context used to respond (e.g. data retrieved from a vector DB in a RAG application)
In our chatbot example, these interactions include the customer's query, the assistant's response, and the ground truth containing accurate airline policies and procedures.
Upload dataset via API
from lastmile.lib.auto_eval import AutoEval
client = AutoEval(api_token=api_token)
dataset_csv_path = "data/strikeair_customer_support/strikeair_customer_support_1024.csv"
dataset_id = client.upload_dataset(
file_path=dataset_csv_path,
name="Customer Service App Data",
description="Dataset containing airline customer chatbot questions and responses"
)
print(f"Dataset created with ID: {dataset_id}")
Upload dataset via UI
You can also use the AutoEval Console to do everything you can do via API.
To upload, Navigate to Dataset Library and cick + New Dataset
This dataset covers a good distribution of interactions that passengers have with an airline chatbot:
Flight status inquiries
Baggage policy questions
Check-in procedures
Meal options and special requests
We are going to evaluate how well the assistant's responses align with ground truth (correctness/faithfulness), as well as a few more composite metrics.
Let's start by computing some key metrics that come out-of-the-box with AutoEval:
Faithfulness - a proxy for hallucination detection, which measures for a given input, how faithful the response is to the provided context or ground truth
Answer relevance - how relevant the response is to the input.
The platform has other useful metrics you can get started with, including summarization score, answer correctness, and toxicity.
from lastmile.lib.auto_eval import BuiltinMetrics
default_metrics = [
BuiltinMetrics.FAITHFULNESS,
BuiltinMetrics.RELEVANCE,
]
print(f"Evaluation Starting...")
evaluation_results = client.evaluate_data(
# Alternatively, you can specify 'data' to be a pandas DataFrame
# data=dataframe,
dataset_id=dataset_id,
metrics=default_metrics,
)
print("Evaluation Results:")
evaluation_results.head(15)
From a cursory look at this data, we can already see that the chatbot's responses are mostly relevant, but in some cases it is hallucinating (faithfulness is very low). We can use this data to pinpoint examples where the application has low faithfulness, and debug further.
While the default metrics provide a good starting point for evaluating our application, we usually have some specific criteria for evaluation. We can prompt an LLM with our evaluation criteria, and ask it to label each row as good (1) or bad (0) based on that evaluation criteria.
AutoEval provides a synthetic data labeling service which uses LLM's to help with labeling. If you do this via the UI console, you can also provide human feedback on the labeling to better steer the LLM's decision making.
1. Define an evaluation prompt
Create a prompt template to label the interactions:
prompt_template = """
You are an evaluator model tasked with assessing an airline's AI customer service agent
based on the following criteria. Return a label of 1 or 0 according to these rules:
Label 1:
- If the input is a customer support query and the output provides accurate
and helpful information using details from the ground_truth.
- If the input is not related to customer support and the output politely
informs the user that it cannot assist with that request.
Label 0:
- If the output contains information not present in the ground_truth (not factual).
- If the output does not appropriately address the user's request as per the above guidelines.
- If the output is irrelevant, incorrect, or unhelpful in the context of the input and ground_truth.
Ground Truth:
{ground_truth}
Input:
{input}
Output:
{output}
Label:
"""
2. Label the dataset
You can run labeling from the AutoEval Console:
Open the dataset to label from https://lastmileai.dev/datasets.
Click on Generate LLM Judge Labels
Fill in your evaluation prompt, click Start Labeling, and follow the steps.
Alternatively, you can run labeling from the SDK:
job_id = client.label_dataset(
dataset_id=dataset_id,
prompt_template=prompt_template,
wait_for_completion=False
)
print(f"Labeling job started with ID: {job_id}")
client.wait_for_label_dataset_job(job_id)
print(f"Labeling Job with ID: {job_id} Completed")
Now, if you download the Dataset, you'll see a label
column populated for each row
dataset = client.download_dataset(dataset_id=dataset_id)
dataset.head(10)
Now that we have labeled data, we can use that to fine-tune a small language model to learn our custom evaluation crteria from the label distribution.
LastMile has developed 400M evaluator models (alBERTa 🍁), and has also adapted the state-of-the-art modernBERT model for eval tasks. These models can be trained quickly and cheaply, and can run inference in < 300ms. This gives you a high quality model that can be customized to your evaluation criteria to provide fast, reliable metrics for your application.
You can fine-tune from the Model Console at https://lastmileai.dev/models, or via the SDK:
fine_tune_job_id = client.fine_tune_model(
train_dataset_id=dataset_id,
test_dataset_id=None, #Optional -- if you have test data, upload it as a dataset first
model_name="Fine-Tuned Evaluator",
)
client.wait_for_fine_tune_job(fine_tune_job_id)
print("Fine-tuning completed!")
Once the fine-tuning job completes (it only takes a few minutes), it is automatically deployed for you to run inference.
You can also find it in the Model Console and go to its Fine-tune Info to see training metrics. The Overview tab has a playground you can use too.
Finally, let's use this fine-tuned model to compute evaluation metrics!
from lastmile.lib.auto_eval import Metric
metric = Metric(name="Fine-Tuned Evaluator") # This is the name of the model you specified in fine_tune_model
print(f"After fine-tuning, the model is deployed to the inference server, so ensure it is online...")
fine_tuned_metric = client.wait_for_metric_online(metric)
print(f"Fine-tuned model available as metric with ID: {fine_tuned_metric.id}")
# Run evaluation using our fine-tuned model
results = client.evaluate_dataset(dataset_id, fine_tuned_metric)
# Display results
results.head(10)
There should be high correlation between the labels and the scores (high scores should correspond to positive labels)
You can learn more about fine-tuning eval metrics in the platform docs.
AutoEval simplifies the process of creating customized evaluation metrics, saving you time and effort. Even more importantly, it gives you the tools and metrics you need to systematically improve your AI applications and shorten your time to production.
In this guide, we showed how to:
Evaluate datasets using built-in metrics.
Label datasets with custom evaluation criteria.
Fine-tune models for domain-specific evaluation metrics.
Ready to try AutoEval? Visit the LastMile AI platform and start building better evaluations today!