In this post

Introducing AutoEval: the first evaluator model fine-tuning service

Written By

Sarmad Qadri

Published onDecember 3, 2024

AutoEval is a platform for debugging, evaluating and improving AI applications. It is the only service that enables fine-tuning small language models for evaluation tasks, allowing developers to train custom metrics that are tailor-made for their AI application.

We are proud to launch AutoEval, the industry’s first Evaluator Fine-Tuning platform that enables developers to design custom metrics by training evaluator models. It solves a key problem in generative AI application development: having metrics that accurately represent your application's performance, which in turn enable rapid iteration and confidence in production.

AutoEval is already in use at several Fortune 500 enterprises across the finance, energy and media sectors to tailor their AI application development with evaluation metrics that best represent their domain and use-case.

AutoEval brings together a number of key innovations, including:

alBERTa 🍁: A 400M parameter small language model (SLM) that is optimized for evaluation tasks and designed for scale.
Efficient fine-tuning: Training infrastructure that lets you efficiently train as many evaluators as you need.
High-performance inference: Generate scores in < 300ms on CPU, enabling online use-cases, including guardrails.
LLM Judge++: Generate high-quality labels for application traces using LLM-as-a-judge evaluation with human-in-the-loop techniques.

Overview of the AutoEval Platform and Key Features

Improving AI applications with Eval-Driven Development

Enterprise developers are using AutoEval to evaluate and improve compound AI applications like RAG pipelines and multi-agent systems. Understanding if an AI application is performing well is deceptively difficult, and often requires defining key metrics:

Is the response faithful to the data provided? (hallucination detection)
Is the input relevant to the application's purpose? (input guardrail)
Does the response adhere to the company's brand voice? (custom tone evaluator)

Like test-driven development (TDD) in software, eval-driven development (EDD) in AI starts by defining success criteria upfront. With metrics in place, developers can experiment and iterate quickly and know if their application is getting better or worse over time.

But defining these metrics is one of the most difficult steps in AI engineering today. AutoEval simplifies this into 3 steps:

1. Out-of-the-box metrics

Start with predefined metrics for common AI applications, such as faithfulness (hallucination detection), relevance, and toxicity.

2. LLM Judge++

Ground truth data is often scarce. AutoEval helps you generate high-quality labels for application trace data by specifying an evaluation criteria in natural language (a prompt), and hand-labeling a few examples to teach the system your criteria.

3. Fine-tune an alBERTa 🍁 evaluator SLM

Use the AutoEval fine-tuning service to train a small language model using the labeled dataset to learn your evaluation criteria. You can fine-tune as many evaluators as you want – one for each evaluation metric you need for your application.

You can then run the fine-tuned model as an offline eval, or as an online guardrail.

Some example evaluators that developers have trained using AutoEval:

A custom response quality metric that measures succinctness, clarity, and accuracy.
A custom correctness metric for tool use (function calling) in a multi-agent system.
A custom brand tone metric to measure LLM response adherence to company's brand tone rubric.

Meet alBERTa 🍁: The engine powering AutoEval

alBERTa is a 400M parameter BERT model that delivers scalable, high-performance evaluation. It has been trained for Natural Language Inference (NLI) tasks and optimized for evaluation workloads. Designed as a probabilistic classifier, it generates precise 0-1 scores, making it ideal for computing metrics that drive your application’s success.

Key value props:

compact -- 400M parameters small, perfect for efficient deployment
blazing-fast -- inference in under 300ms on CPU.
customizable -- fine-tune for any evaluation task
enterprise-ready -- fully self-hostable for secure VPC deployment

Enterprise VPC deployment-ready

AutoEval is already in use at several Fortune 500 enterprises, and can be deployed into any Virtual Private Cloud (VPC) on AWS, Azure, Google Cloud or on-prem environment for added security and privacy guarantees.

To learn more about enterprise deployments for your business, please contact sales@lastmileai.dev to get in touch.

Get Started with AutoEval

AutoEval is now available on https://lastmileai.dev. It is free to get started, run evals, generate labels with LLM Judge, fine-tune custom metrics and create guardrails.

The platform is accessible both via UI and API, including Python and TypeScript client SDKs.

Read our Getting Started guide and start using AutoEval in under 5 minutes.

"Evals are surprisingly often all you need"
Greg Brockman, Co-Founder OpenAI