In this post

AutoEval Fine-tuning Benchmarks

Written By

Dr. Mahtab Sarvmaili

Published onJanuary 21, 2025

AutoEval is a cutting-edge platform designed to help developers debug, evaluate, and enhance AI applications. It stands out as the only service that enables fine-tuning of small language models specifically for evaluation tasks. This unique capability allows developers to create custom metrics tailored precisely to the needs of their AI applications to efficiently evaluate the performance of their applications.

In this post, we present benchmarks of fine-tuning alBERTa models for specific evaluation metrics. It showcases the value of fine-tuning, and gives some insight into how much data is required to achieve highly accurate custom metrics.

We measure 2 key dimensions of the fine-tuned models: accuracy and speed (inference time), and demonstrate that you can achieve better accuracy with 10x speed. For example, for faithfulness, we can achieve 91-96% accuracy with dramatically faster inference and cost than LLM Judge with gpt-4o:

About AutoEval

At its core, AutoEval solves a critical challenge in generative AI development: ensuring that evaluation metrics accurately reflect the performance of an AI application. With these metrics, developers can iterate quickly and deploy their applications with confidence.

AutoEval is designed to assess the quality of results from complex AI applications, such as Retrieval-Augmented Generation (RAG) pipelines and agent systems. It evaluates whether these applications are performing well using a comprehensive suite of metrics. These metrics include Faithfulness, Relevance, Summarization, each targeting a specific aspect of the system's performance.

You can get started with AutoEval from https://lastmileai.dev.

Benchmarking AutoEval

Rigorous benchmarking provides confidence in the reliability and accuracy of AutoEval’s fine-tuned models, ensuring they deliver consistent, domain-specific insights for evaluating AI applications.

Performance Validation: It ensures that the fine-tuned models meet the desired predictive performance standards for their intended evaluation tasks.
Efficiency Assessment: It evaluates the efficiency of pre-trained models and tracks the impact of fine-tuning on adapting these models to specific evaluation criteria.
Baseline Comparison: By comparing the performance of fine-tuned models against baseline models, benchmarking helps identify improvements or regressions in the evaluation capabilities.

Overview

We present performance and efficiency benchmarks for three key metrics: Faithfulness, Summarization and Relevance. This evaluation involves two primary components:

Baseline Models:
For each criterion, baseline models are selected based on the availability of competitive solutions. The purpose of this comparison is to establish a performance baseline for each metric, assessing how AutoEval's baseline models stand against existing industry benchmarks. These baselines include:
- LLMs typically used for LLM Judge evals, such as GPT-4o
- LastMile's alBERTa model (AutoEval's pre-trained model)
Fine-Tuned Models:
Models fine-tuned on AutoEvalthat are trained for each metric using curated datasets. This step explores the impact of fine-tuning on the performance of evaluative models. Specifically, it measures how fine-tuning influences predictive accuracy and task-specific alignment compared to both the baseline AutoEval models and external competitors.

Benchmarking Setup

We evaluate performance across multiple metrics using well-known benchmarking datasets. For each evaluation metric, the process is designed to analyze performance across Baseline Models and Fine-Tuned Models described above.

The number of training and testing samples varies for each dataset based on available data, and fine-tuning datasets are sampled at multiple sizes. The aim is to investigate whether the size of fine-tuning data influences the models' predictive accuracy and efficiency.

Datasets and benchmarking code is available upon request.

Evaluation Metrics

Accuracy: Measures the correctness of pairwise comparisons for each metric using the test (holdout) data.
Inference Efficiency: This metric measures the CPU wall-clock time required to process the test dataset, with the total time expressed in seconds. Specifically for the finetuned and LastMile method, the evaluation function processes the test data in batches of 25 samples, passing each batch to the API for inference. The baseline models takes the data in the form of a dataframe and return the obtained results.
The inference time is measured using the CPU wall-clock, based on time.time() that returns all results. This provides a plausible estimate of the total elapsed time (including the network latency) for inference, capturing the time taken to handle all test examples end-to-end.

Benchmarking Results

Faithfulness

Faithfulness is measured by assessing the probability (0 → 1) of how well an LLM-generated output adheres to the given context or ground truth. The key question addressed is: To what extent does the generated output remain faithful to the provided information without introducing inaccuracies or hallucinations?

Dataset Details

Source: HaluEval Benchmark

Fine-Tuning Data

Sampling Sizes: Fine-tuning datasets are sampled at sizes of [100, 200, 400, 1000].
Required field:
- input - e.g. user query
- ground_truth - context used to generate output
- output - LLM response
- label:
  - 0 - Hallucination (Not Faithful)
  - 1 - Factual (Faithful)

Test Data

Size: A fixed set of 1000 samples is used for benchmarking both baseline and fine-tuned models to maintain consistency.
Evaluation: The test dataset is used to benchmark model performance for baselines and fine-tuned models.

Benchmarking models

Baseline models:
- LastMile_Faithfulness (alBERTa)
- Phoenix Arize - GPT-4o and GPT-4o-mini
Fine tuned models (alBERTa): LastMile_100 , LastMile_200, LastMile_400 , LastMile_1000
- alBERTa fine-tuned with 100, 200, 400, and 1000 training samples, respectively.

Results

Accuracy: We observe that fine-tuning with even a small amount of data (100 samples), yields > 90% metric accuracy, surpassing gpt-4o. Fine-tuning with 1000 samples yields 96% accuracy, dramatically better than 88% with gpt-4o.

An interesting observation is that increasing the number of samples from 100 to 200 does not significantly result in an improvement in the model's performance. This phenomenon may occur due to the few reasons:

Challenging Samples and distribution shift: Introducing new training samples that are more difficult or complex to learn may initially degrade the model's performance. These challenging samples can cause shifts from the existing data distribution, making it harder for the model to generalize effectively.
Need for Similar Samples: If the new samples are challenging, the model may require additional, similar samples to stabilize and refine its parameters.

Efficiency: Across all fine-tuned models, inference speed is 10x better than gpt-4o-mini, and 13x better than gpt-4o. This is despite the fine-tuned models being served on CPU.

Relevance

The relevance metric measures the semantic similarity between two input and output strings.

Answer/Retrieved Context relevancy.
Question/Answer relevancy.

For benchmarking, we measure if the model correctly predicts pairwise comparisons of relevance evaluation values for (input, output, label) triples. Each input is paired with output labeled as relevant (1) or irrelevant (0).

Dataset Details

Source: WikiQA dataset – WikiQA-test.
Data Overview:
- 6165 question-answer pairs.
- 293 rows labeled as relevant.
- A subset of 279 pairs formed by pairing questions with answers of differing labels (relevant/irrelevant).
Labels: Positive labels indicate the relevancy of output to the given input.

Fine-Tuning Data

Sampling Sizes: Fine-tuning datasets are sampled at sizes of [120, 190, 250].
Distribution: A balanced 50% positive and 50% negative label distribution ensures that models are trained on equally weighted relevant and irrelevant examples.
Objective: Investigate how the size of the fine-tuning dataset influences model performance.
Required fields:
- input - one string to compare (e.g. input query)
- output - second string to compare (e.g. generated response)
- label - irrelevant (0), relevant (1)

Test Data

Size: 250 question-answer pairs.
Fine-tuned Batch Evaluation: Testing is conducted in increments of 25 batches to monitor performance trends across the dataset. Client side batching was integrated.

Benchmarking models

Baseline models:
- LastMile_Relevance (sentence-transformers cosine similarity)
- Phoenix Arize - GPT-4o, GPT-3.5-turbo
- RAGAS - GPT-4o, GPT-4o-mini
Fine tuned models (alBERTa): LastMile_120 , LastMile_190, LastMile_250
- alBERTa fine-tuned with 120, 190 and 250 training samples, respectively.

Results

Accuracy: We observe comparable metric performance with GPT-4o on RAGAS compared with fine-tunes of 250 samples. However, GPT-4o on Phoenix has better accuracy than GPT-4o on RAGAS, which could be due to the role of prompting strategies. Also, the dataset is limited, which may affect the accuracy measurements.

Efficiency: As with faithfulness, the fine-tuned relevance models are significantly faster to operate, despite being on CPU. The discrepancy of wall-clock time of gpt-4o across different evaluation libraries could be down to how those models are invoked, or other nuances with network load for API-mediated inference.

Summarization Score

Summarization Score evaluates the accuracy of AI-generated summaries by comparing document-summary pairs against human annotations. The task involves pairwise comparisons of document, correct_summary, and wrong_summary triples to assess whether predicted summaries align with ground truth annotations.

Dataset Details

Source: CNN Daily News Mail dataset
Data Overview:
- Contains 1100 triples of document, correct_summary, and wrong_summary.
- Focused on pairwise comparisons to determine alignment with human-annotated summaries.

Fine-Tuning Data

Sampling Sizes: Fine-tuning datasets are sampled at sizes of [100, 200, 400, 700].
Required fields:
- output - LLM generated summary
- ground_truth - source document (e.g. context used to generate output)
- label: poor summary (0), good summary (1)

Test Data

Size: 690 labeled document-summary pairs.
Evaluation: The test dataset is used to benchmark the performance of both baseline and fine-tuned models.

Benchmarking models

Baseline models:
- LastMile_Summarization (alBERTa)
- Phoenix Arize - GPT-4o and GPT-4o-mini
Fine tuned models (alBERTa): LastMile_100 , LastMile_200, LastMile_400 , LastMile_700
- alBERTa fine-tuned with 100, 200, 400, and 700 training samples, respectively.

Results

Accuracy: We observe that fine-tuning yields only marginal benefit over baseline Summarization score metric in AutoEval. Overall, using alBERTa for summarization score metric significantly outperforms gpt-4o and gpt-4o-mini.

Efficiency: alBERTa wall time inference efficiency continues to dramatically outperform LLMs. Since alBERTa is a 400M parameter model, this is expected; however, the results are particularly impressive due to CPU-based inference..

Conclusion

The benchmarking results for AutoEval demonstrate the impact of using small language models for common evaluation tasks.

Significant Performance Improvements through Fine-Tuning:
Fine-tuned models outperformed their baseline counterparts across all metrics, but the effect was most pronounced for Faithfulness.
Effectiveness of Balanced Training Data:
Ensuring an equal distribution of positive and negative samples during fine-tuning proved critical in maintaining model consistency.
This balanced approach minimized bias and enhanced the models’ ability to generalize across diverse inputs.
Efficiency of LastMile Models:
AutoEval’s fine-tuned LastMile models exhibited significantly lower inference times compared to baseline and competitor models, making them a practical choice for real-world applications.
This efficiency ensures that high-performance evaluation tasks can be executed without compromising speed.

Final thoughts

AutoEval's fine-tuning capability provides developers with the tools to create domain-specific evaluation metrics that are both accurate and efficient. This helps ensure that compound AI applications can be rigorously evaluated while maintaining practical usability.

In this post

AutoEval Fine-tuning Benchmarks

About AutoEval

Benchmarking AutoEval

Overview

Benchmarking Setup

Evaluation Metrics

Benchmarking Results

Faithfulness

Dataset Details

Fine-Tuning Data

Test Data

Benchmarking models

Results

Relevance

Dataset Details

Fine-Tuning Data

Test Data

Benchmarking models

Results

Summarization Score

Dataset Details

Fine-Tuning Data

Test Data

Benchmarking models

Results

Conclusion

Final thoughts

More Posts

Bertelsmann / LastMile Case Study

Introducing AutoEval Experiments