In this post

Introducing AutoEval Experiments

Written By

Phil Chang

Published onMarch 4, 2025

Announcing Experiments in the AutoEval Platform

Experimentation is the key to achieving production-ready generative AI applications. With Experiments in AutoEval, you can track and manage changes to your generative AI applications, allowing you to compare performance, test changes and validate improvements.

Building high-performing generative AI applications requires constant iteration. Small changes—whether switching models, modifying retrieval strategies, or refining system prompts—can significantly impact your application’s performance. But without a structured approach to experimentation, teams risk making changes based on intuition rather than measurable improvements.

Experiments provide an intuitive, systematic way to track, compare and optimize your AI applications. You can test different configurations and measure performance with precision using our evaluation models, and fine-tuning your own evaluation metrics.

Why Experimentation Matters

AI applications are growing increasingly more sophisticated in their architecture, involving more infrastructure and configurable components that can each have a significant impact on cost, latency and application behavior. Which LLM, retrieval strategy, data processing strategy and how prompts are structured at each step all affect the speed, latency and intelligence of your application’s performance.

Due to the number of configurable parameters, it’s difficult to determine which combination of your changes are improving the overall performance of the system. Experimentation allows you to go beyond trial and error by testing hypotheses in a controlled, data-driven way.

With AutoEval Experiments, you can:

Compare Model Versions: Evaluate how swapping model versions or model providers impacts performance. For example, GPT4o vs. GPT4.5, or O1 vs. Claude 3.7.
Optimize Retrieval Strategies for RAG: Test different chunking and retrieval methods in retrieval-augmented generation (RAG) pipelines.
Refine Agent Configuration: Assess how prompt engineering adjustments affect faithfulness, completeness and response quality.
360 View on AutoEval: Use AutoEval’s suite of out-of-the-box metrics, such as faithfulness, relevance and much more, to get a holistic view into your application’s performance with each experiment.

By structuring your experiments with AutoEval, you can ensure that every change you make leads to measurable, data-backed improvements.

Getting Started with Experiments

Follow the Experiments cookbook to get started quickly.

If this is your first time using AutoEval, follow our Quickstart guide to set up your account and get a free API key. Then, you can log experiments via the lastmile SDK:

from lastmile.lib.auto_eval import AutoEval, BuiltinMetrics

client = AutoEval(api_token="YOUR_API_TOKEN")

experiment = client.create_experiment(
    name="Experiment #1",
    description="My first experiment",
    metadata={
        # Record metadata about the experiment here
        "dataset_id": dataset_id,
        "dataset_version": "0.1.0",
    }
)

metrics = [
    BuiltinMetrics.FAITHFULNESS,
    BuiltinMetrics.RELEVANCE,
]

print(f"Evaluation job kicked off")
evaluation_results = client.evaluate_dataset(
    dataset_id=dataset_id,
    metrics=metrics,
    experiment_id=experiment.id,
    metadata={
        # Set your model, temperature and any other app metadata for the eval run
        "model": MODEL_ID, 
        "temperature": TEMPERATURE,
        "retrieval_alg": "cosine_similarity"
    }
)

print("Evaluation Results:")
evaluation_results.head(10)

Compare experiment runs to identify the best-performing configuration of your AI application:

Drill down into individual experiment runs to identify performance bottlenecks with specific examples:

🚀 If you’re ready to dive into experimentation, check out the experimentation documentation and full cookbook. Experiments unlock the full potential of the AutoEval platform:

Real-time Insights: Instantly see how each experiment affects model performance.
Comprehensive Tracking: Keep a detailed log of model versions, retrieval strategies, chunk sizes, dataset versions and more.
Seamless Integration: Define, execute and compare experiments directly via the AutoEval SDK or Experiments Console.

If you have questions or feedback, reach out to us at team@lastmileai.dev. We’re here to help and always excited to talk with users!

Happy Experimenting!

In this post

Introducing AutoEval Experiments

Announcing Experiments in the AutoEval Platform

Why Experimentation Matters

Getting Started with Experiments

More Posts

Bertelsmann / LastMile Case Study

AutoEval Fine-tuning Benchmarks