In this post
AutoEval Fine-tuning Benchmarks
AutoEval is a cutting-edge platform designed to help developers debug, evaluate, and enhance AI applications. It stands out as the only service that enables fine-tuning of small language models specifically for evaluation tasks. This unique capability allows developers to create custom metrics tailored precisely to the needs of their AI applications to efficiently evaluate the performance of their applications.
In this post, we present benchmarks of fine-tuning alBERTa models for specific evaluation metrics. It showcases the value of fine-tuning, and gives some insight into how much data is required to achieve highly accurate custom metrics.
We measure 2 key dimensions of the fine-tuned models: accuracy and speed (inference time), and demonstrate that you can achieve better accuracy with 10x speed. For example, for faithfulness, we can achieve 91-96% accuracy with dramatically faster inference and cost than LLM Judge with gpt-4o:
At its core, AutoEval solves a critical challenge in generative AI development: ensuring that evaluation metrics accurately reflect the performance of an AI application. With these metrics, developers can iterate quickly and deploy their applications with confidence.
AutoEval is designed to assess the quality of results from complex AI applications, such as Retrieval-Augmented Generation (RAG) pipelines and agent systems. It evaluates whether these applications are performing well using a comprehensive suite of metrics. These metrics include Faithfulness, Relevance, Summarization, each targeting a specific aspect of the system's performance.
You can get started with AutoEval from https://lastmileai.dev.
Rigorous benchmarking provides confidence in the reliability and accuracy of AutoEval’s fine-tuned models, ensuring they deliver consistent, domain-specific insights for evaluating AI applications.
Performance Validation: It ensures that the fine-tuned models meet the desired predictive performance standards for their intended evaluation tasks.
Efficiency Assessment: It evaluates the efficiency of pre-trained models and tracks the impact of fine-tuning on adapting these models to specific evaluation criteria.
Baseline Comparison: By comparing the performance of fine-tuned models against baseline models, benchmarking helps identify improvements or regressions in the evaluation capabilities.
Overview
We present performance and efficiency benchmarks for three key metrics: Faithfulness, Summarization and Relevance. This evaluation involves two primary components:
Baseline Models:
For each criterion, baseline models are selected based on the availability of competitive solutions. The purpose of this comparison is to establish a performance baseline for each metric, assessing how AutoEval's baseline models stand against existing industry benchmarks. These baselines include:LLMs typically used for LLM Judge evals, such as GPT-4o
LastMile's alBERTa model (AutoEval's pre-trained model)
Fine-Tuned Models:
Models fine-tuned on AutoEvalthat are trained for each metric using curated datasets. This step explores the impact of fine-tuning on the performance of evaluative models. Specifically, it measures how fine-tuning influences predictive accuracy and task-specific alignment compared to both the baseline AutoEval models and external competitors.
Benchmarking Setup
We evaluate performance across multiple metrics using well-known benchmarking datasets. For each evaluation metric, the process is designed to analyze performance across Baseline Models and Fine-Tuned Models described above.
The number of training and testing samples varies for each dataset based on available data, and fine-tuning datasets are sampled at multiple sizes. The aim is to investigate whether the size of fine-tuning data influences the models' predictive accuracy and efficiency.
Datasets and benchmarking code is available upon request.
Evaluation Metrics
Accuracy: Measures the correctness of pairwise comparisons for each metric using the test (holdout) data.
Inference Efficiency: This metric measures the CPU wall-clock time required to process the test dataset, with the total time expressed in seconds. Specifically for the finetuned and LastMile method, the evaluation function processes the test data in batches of 25 samples, passing each batch to the API for inference. The baseline models takes the data in the form of a dataframe and return the obtained results.
The inference time is measured using the CPU wall-clock, based on
time.time()
that returns all results. This provides a plausible estimate of the total elapsed time (including the network latency) for inference, capturing the time taken to handle all test examples end-to-end.
Faithfulness
Faithfulness is measured by assessing the probability (0 → 1) of how well an LLM-generated output adheres to the given context or ground truth. The key question addressed is: To what extent does the generated output remain faithful to the provided information without introducing inaccuracies or hallucinations?
Dataset Details
Source: HaluEval Benchmark
Fine-Tuning Data
Sampling Sizes: Fine-tuning datasets are sampled at sizes of [100, 200, 400, 1000].
Required field:
input
- e.g. user queryground_truth
- context used to generate outputoutput
- LLM responselabel
:0 - Hallucination (Not Faithful)
1 - Factual (Faithful)
Test Data
Size: A fixed set of 1000 samples is used for benchmarking both baseline and fine-tuned models to maintain consistency.
Evaluation: The test dataset is used to benchmark model performance for baselines and fine-tuned models.
Benchmarking models
Baseline models:
LastMile_Faithfulness (alBERTa)
Phoenix Arize - GPT-4o and GPT-4o-mini
Fine tuned models (alBERTa): LastMile_100 , LastMile_200, LastMile_400 , LastMile_1000
alBERTa fine-tuned with 100, 200, 400, and 1000 training samples, respectively.
Results
Accuracy: We observe that fine-tuning with even a small amount of data (100 samples), yields > 90% metric accuracy, surpassing gpt-4o. Fine-tuning with 1000 samples yields 96% accuracy, dramatically better than 88% with gpt-4o.
An interesting observation is that increasing the number of samples from 100 to 200 does not significantly result in an improvement in the model's performance. This phenomenon may occur due to the few reasons:
Challenging Samples and distribution shift: Introducing new training samples that are more difficult or complex to learn may initially degrade the model's performance. These challenging samples can cause shifts from the existing data distribution, making it harder for the model to generalize effectively.
Need for Similar Samples: If the new samples are challenging, the model may require additional, similar samples to stabilize and refine its parameters.
Efficiency: Across all fine-tuned models, inference speed is 10x better than gpt-4o-mini, and 13x better than gpt-4o. This is despite the fine-tuned models being served on CPU.
Relevance
The relevance metric measures the semantic similarity between two input and output strings.
Answer/Retrieved Context relevancy.
Question/Answer relevancy.
For benchmarking, we measure if the model correctly predicts pairwise comparisons of relevance evaluation values for (input
, output
, label
) triples. Each input
is paired with output
labeled as relevant (1) or irrelevant (0).
Dataset Details
Source: WikiQA dataset – WikiQA-test.
Data Overview:
6165 question-answer pairs.
293 rows labeled as relevant.
A subset of 279 pairs formed by pairing questions with answers of differing labels (relevant/irrelevant).
Labels: Positive labels indicate the relevancy of output to the given input.
Fine-Tuning Data
Sampling Sizes: Fine-tuning datasets are sampled at sizes of [120, 190, 250].
Distribution: A balanced 50% positive and 50% negative label distribution ensures that models are trained on equally weighted relevant and irrelevant examples.
Objective: Investigate how the size of the fine-tuning dataset influences model performance.
Required fields:
input
- one string to compare (e.g. input query)output
- second string to compare (e.g. generated response)label
- irrelevant (0), relevant (1)
Test Data
Size: 250 question-answer pairs.
Fine-tuned Batch Evaluation: Testing is conducted in increments of 25 batches to monitor performance trends across the dataset. Client side batching was integrated.
Benchmarking models
Baseline models:
LastMile_Relevance (
sentence-transformers
cosine similarity)Phoenix Arize - GPT-4o, GPT-3.5-turbo
RAGAS - GPT-4o, GPT-4o-mini
Fine tuned models (alBERTa): LastMile_120 , LastMile_190, LastMile_250
alBERTa fine-tuned with 120, 190 and 250 training samples, respectively.
Results
Accuracy: We observe comparable metric performance with GPT-4o on RAGAS compared with fine-tunes of 250 samples. However, GPT-4o on Phoenix has better accuracy than GPT-4o on RAGAS, which could be due to the role of prompting strategies. Also, the dataset is limited, which may affect the accuracy measurements.
Efficiency: As with faithfulness, the fine-tuned relevance models are significantly faster to operate, despite being on CPU. The discrepancy of wall-clock time of gpt-4o across different evaluation libraries could be down to how those models are invoked, or other nuances with network load for API-mediated inference.
Summarization Score
Summarization Score evaluates the accuracy of AI-generated summaries by comparing document-summary pairs against human annotations. The task involves pairwise comparisons of document, correct_summary, and wrong_summary triples to assess whether predicted summaries align with ground truth annotations.
Dataset Details
Source: CNN Daily News Mail dataset
Data Overview:
Contains 1100 triples of document, correct_summary, and wrong_summary.
Focused on pairwise comparisons to determine alignment with human-annotated summaries.
Fine-Tuning Data
Sampling Sizes: Fine-tuning datasets are sampled at sizes of [100, 200, 400, 700].
Required fields:
output
- LLM generated summaryground_truth
- source document (e.g. context used to generate output)label
: poor summary (0), good summary (1)
Test Data
Size: 690 labeled document-summary pairs.
Evaluation: The test dataset is used to benchmark the performance of both baseline and fine-tuned models.
Benchmarking models
Baseline models:
LastMile_Summarization (alBERTa)
Phoenix Arize - GPT-4o and GPT-4o-mini
Fine tuned models (alBERTa): LastMile_100 , LastMile_200, LastMile_400 , LastMile_700
alBERTa fine-tuned with 100, 200, 400, and 700 training samples, respectively.
Results
Accuracy: We observe that fine-tuning yields only marginal benefit over baseline Summarization score metric in AutoEval. Overall, using alBERTa for summarization score metric significantly outperforms gpt-4o and gpt-4o-mini.
Efficiency: alBERTa wall time inference efficiency continues to dramatically outperform LLMs. Since alBERTa is a 400M parameter model, this is expected; however, the results are particularly impressive due to CPU-based inference.
The benchmarking results for AutoEval demonstrate the impact of using small language models for common evaluation tasks.
Significant Performance Improvements through Fine-Tuning:
Fine-tuned models outperformed their baseline counterparts across all metrics, but the effect was most pronounced for Faithfulness.
Effectiveness of Balanced Training Data:
Ensuring an equal distribution of positive and negative samples during fine-tuning proved critical in maintaining model consistency.
This balanced approach minimized bias and enhanced the models’ ability to generalize across diverse inputs.
Efficiency of LastMile Models:
AutoEval’s fine-tuned LastMile models exhibited significantly lower inference times compared to baseline and competitor models, making them a practical choice for real-world applications.
This efficiency ensures that high-performance evaluation tasks can be executed without compromising speed.
Final thoughts
AutoEval's fine-tuning capability provides developers with the tools to create domain-specific evaluation metrics that are both accurate and efficient. This helps ensure that compound AI applications can be rigorously evaluated while maintaining practical usability.