In this post
Bertelsmann / LastMile Case Study
Trustworthy AI Search in the Generative AI Era: Building and Evaluating a Multi-Agent RAG System at Scale
A Joint Case Study by the Bertelsmann AI Hub & LastMile AI
Powerful multi-agent systems are no longer hard to prototype—but they’re still hard to trust. Why? Building at least a first draft or demo of powerful systems that pull proprietary information from your own databases or take actions on your behalf has become more straightforward and speedy thanks to powerful frameworks - think Langchain and the like. Making these systems bulletproof and consumer facing is tough - LLMs are still stochastic and sometimes hard to control. How can we know that our agentic systems actually do the job we want them to do? How can we monitor and test them effectively? How can we build trust - both for developers and eventually consumers? And how do we do so efficiently and at scale?
Here, we describe how we’ve solved a lot of these challenges in a state-of-the-art multi-agent system, developed by Bertelsmann, a global leader in the media industry. Bertelsmann’s use case—an agentic content search platform for their creatives—presented the perfect proving ground for LastMile’s agentic and evaluation infrastructure.
Together, we focused on enabling real-time, cost-effective, and context-aware evaluation—not just to measure performance, but to accelerate it. LastMile contributed compact, self-hosted models tailored to Bertelsmann’s use case, powering automated metrics like relevance, faithfulness, and overall quality. This helped Bertelsmann identify and fix routing issues, surface hallucinations, improve agent selection logic, and unlock new capabilities like real-time evals and targeted active learning.
This allowed us to make steps towards answering crucial general questions when developing agentic and retrieval-augmented generation systems including:
How can we confidently measure the quality of answers for diverse domains?
How can we reduce cost and latency of these evaluations and agentic systems without sacrificing accuracy?
How do we integrate real-world feedback and continuously improve?
The following post will outline how we achieved solutions to these questions in detail. By doing so, we want to help AI practitioners build better agentic themselves.
Bertelsmann is one of the world’s largest media companies—with subsidiaries spanning the world’s biggest book publisher Penguin Random House, European television powerhouse RTL as well as music label BMG education, and digital content. Just in the last year, Bertelsmann’s products have won Pulitzers, Grammys, Emmy’s and Oscars. But with that scale comes fragmentation: data lives in silos, each division running its own systems, formats, and platforms. E.g., Penguin Random House manages its book catalog through one stack, RTL has an entirely different setup for streaming metadata, while news content might sit in another. There’s no single “Bertelsmann database.”
So what happens when a creative or researcher at Bertelsmann wants to answer a simple question like:
“What kind of content do we have on Barack Obama?”
The answer might live in dozens of places: the Obama biographies we published via Penguin, documentaries available via our news channels, podcasts available on our streaming platforms, and even third-party commentary from the open web. Finding this content used to mean knowing where to look—and having access to each system.
The Bertelsmann Content Search changes that. Built as a multi-agent system by the Bertelsmann AI Hub, the platform allows Bertelsmann’s creatives to ask natural language questions and receive unified, trustworthy answers—without needing to know which system holds what. Behind the scenes, a router directs queries to specialized agents, each responsible for searching a specific domain. One agent might dig into the RTL archives, another into PRH’s book catalog, while a third checks external sources for real-time web trends.
Each agent returns its own answer, which then get distilled into a single, coherent response. The user sees just one clean answer—but behind it lies a distributed orchestration of knowledge retrieval across the entire Bertelsmann ecosystem.
This agentic design makes it possible to surface the right information—across diverse formats, brands, and platforms—without centralizing all data. It empowers creatives, marketers, and producers to work faster, stay informed, and make better content decisions.
But with this flexibility comes complexity. As we’ll explore, ensuring consistent answer quality, surfacing relevant results, and maintaining trust across such a decentralized system required robust evaluation infrastructure—where LastMile’s tools made the difference.
Tackling the Core Challenges
Before launching the Bertelsmann Content Search, it was clear that building a multi-agent system wasn’t just a technical challenge—it was a trust challenge. With so many components working together across different data domains, the risks were real: poorly routed queries, hallucinated answers, inconsistent agent behavior, and a lack of reliable feedback signals for improvement. We built a comprehensive evaluation process that could monitor and improve performance across the entire pipeline. Together, we defined key evaluation metrics, developed methods to compute them at a trace level, and implemented fast, cost-effective models that made real-time evaluation and system improvement possible. Here are some of the highlights:
Improved Tool Call Accuracy Through Better Agent Routing
Implemented an enhanced agent selection router model that intelligently routes queries to the most relevant agents.
Cost-Effective Evaluation at Scale with Self-Hosted Models
Trained and deployed compact 400M parameter alBERTa models optimized for CPU-based inference.
Enables cost-effective, real-time evaluation of system outputs in scenarios where evals are operated at scale.
Specialized models for key eval metrics, including Long-Document Faithfulness
Developed custom evaluation metrics, and developed novel techniques to compute metrics over 128k tokens of context.
Lifted faithfulness AUC score from 0.71 to 0.84+, enabling the system to reliably detect hallucinations or inconsistencies.
Accelerated Model Improvement via Targeted Sampling and Active Learning
Established a data-driven sampling process to surface most impactful areas for manual inspection.
Identified key product aspects requiring further development to meet performance goals.
Generated 5,000+ high-quality labeled datapoints in just days by combining LLM-based weak labeling with human-in-the-loop active learning.
Continuous active learning of evaluation metrics to power a virtuous cycle of targeted sampling and improvement.
Unlocked New Capabilities Through Real-Time Evaluation
Deployed evaluation models for online inference.
While we didn’t get to this yet, online evaluation opens the door for these possibilities:
Automated guardrails to catch low-quality or sensitive outputs before reaching end users.
Improved agent selection by dynamically routing based on predicted faithfulness scores.
Enhanced response generation by filtering or reweighting agent outputs based on evaluation metrics,
Opens the door for evaluation-driven fine-tuning of underlying retrieval and generation models.
We designed content search as a multi-agent system. An incoming query comes in and follows several steps:
Router: A specialized orchestration router (the “mailman”) dispatches incoming queries to relevant domain-specific agents.
Domain Agents: Each agent is focused on a specific data source, usually corresponding to a content vertical of a Bertelsmann subsidiary with its own data warehouses and APIs. Some exemplary agents include among others:
Penguin Random House (PRH) agents for books,
RTL agents for information on RTL TV shows and media,
Web search agent for internet access to cover real-time information on trends, outside views etc.
Each agent gets the incoming query, and retrieves relevant information from its data warehouse that can help answer the question. The agent loops and makes multiple tool calls, eventually coalescing a response.
Summarizer: Collects responses from all agents, combines them, and delivers a single final answer to the user.
This approach excels at handling heterogeneous data because each agent can be fine-tuned or configured for a unique dataset or task. However, it also raises complexity around ensuring consistent quality: if one agent underperforms or provides incomplete context, the entire user experience can suffer.
When developing such a complex system, we faced several challenges:
Minimal User Feedback: As a new platform, particularly before the launch, the platform lacked large-scale user logs to inform or validate system performance.
Diverse Content & Metrics: The data distribution was vast. Queries ranged from licensing and author rights to TV broadcast schedules, requiring a flexible but robust evaluation framework. This is particularly severe for decentralized data sources, where an agent interacts with an API rather than a central vector database.
Cost & Latency Concerns: Using large language models (LLMs) as “judges” for every single output would have been prohibitively expensive and slow for real-time applications.
Enforcing Toolcalls: Some questions would get answered by the LLM directly, without invoking the RAG toolcalls. This leads to sub-optimal results, e.g. when asked for “Who is the CEO of a given Bertelsmann Division?” and the LLM responds with the knowledge acquired during training, which might not be up to date anymore.
AI applications are getting increasingly complex and distributed. These compound AI systems make evaluation more important than ever, and central to the developer workflow. Eval-driven development for AI engineering is what Test-Driven Development has become for software engineering.
Crucially, it’s often not enough to just evaluate the end-to-end system performance. It’s crucial to measure the performance of the individual steps in order to identify bottlenecks that require closer inspection.
In addition, we had these challenges:
Uncertainty Around Performance: No established benchmark or historical user data existed to measure “correctness” for highly specialized queries.
Lack of Domain-Labeled Data: Each Bertelsmann subsidiary has distinct data formats and terminologies, complicating the creation of a unified ground truth.
LLM Judges: Even though LLM judges were a good enough getting started point, they didn’t capture the nuances of Bertelsmann entities. Plus, they are still expensive to operate at scale.
Real-Time and Offline Needs: We required both batch evaluation (to assess performance on large datasets) and real-time scoring (to power guardrails and adaptive routing).
LastMile AI Collaboration
LastMile AI specializes in helping enterprises deploy AI applications to production with confidence. Their AutoEval platform supports AI governance and evaluation by providing:
Metric Fine-tuning service for developing custom evaluation metrics.
Compact Model Architectures optimized for CPU-based inference.
Real-Time Hosting & Monitoring for both offline and live use.
Active Learning pipeline to refine labels on the “edge cases” most critical to model performance.
Our collaboration with the Bertelsmann AI Hub resulted in developing a cost-effective, high-accuracy evaluation system that could scale across multiple agents and data silos.
To solve the challenge of evaluating this complex system, we worked on innovative techniques to evaluate individual agent performance, as well as end-to-end performance.
All of this requires data, and so we broke down the work across:
Metric Definition: In a first step discuss which metrics hold the highest business value and how to observe these.
Weak Labeling: Use GPT-4-based “judges” to label query–response pairs, creating a large (but noisy) dataset quickly.
Fine-Tuning: Train 400M-parameter BERT models on these labels for relevance, faithfulness, and product-specific quality criteria.
Active Learning: Identify samples where the model disagrees with the LLM or is highly uncertain, then forward them to domain experts for manual review.
Retraining & Deployment: Continuously refine the models and deploy them via Docker containers within the Bertelsmann Azure environment for batch and real-time scoring.
Back to 1, rinse and repeat.
This workflow got us a continuously improving metric quality and a continuously increasing amount of high quality data.
While defining metrics sounds straightforward, it’s not. There are usually several factors contributing to a single rating giving way for plenty of edge cases. Also, interactions between multiple metrics can be tricky. Setting up definitions for individual metrics has a huge impact on the success of the machine learning step. As with any machine learning project the learning task and well defined labels are crucial.
Our recommended approach for this is to get domain experts on board as early as possible and manually annotate some data, even before training the first model. Think about why a trace is assigned a certain label and try to generalize. Also make sure one metric has one purpose and avoid conflicts between multiple metrics.
The key breakthrough was codifying the criteria with which we judged quality, which allowed humans to label or judge AI evaluator quality, and also helped define quality across the board.
Fun Fact: 5,000+ Labels in Days
By employing GPT-4 for initial labeling, the joint team generated over 5,000 labeled examples in just a few days—an effort that would otherwise have taken weeks or months with purely human-driven labeling. Although these labels contained some noise, they served as a jumpstart for training compact evaluator models. Unrealistic traces were excluded during the human-in-the-loop refinement.
Human-In-The-Loop Refinement
A key insight was to use active learning to ensure domain experts only spent time where it mattered most. The system flags high-uncertainty samples or disagreements between the BERT model and the LLM-based labels, dramatically increasing the efficiency of human labeling. Each labeling pass led to noticeable AUC gains—in early stages often 15–20 percentage points per iteration—until the models plateaued at strong performance levels.
Beyond this we opted for "labeling committees" instead of splitting the work between individuals. While it might sound inefficient it ensures a common understanding of the metrics, avoids inconsistent labels and helps validate the metric definitions.
Improved Accuracy and Faithfulness
Relevance: Increased AUC from 0.71 to 0.88, equivalent to a ~40% reduction in incorrect judgments.
Faithfulness: Custom evaluator models handling contexts up to 128k tokens improved AUC from 0.71 to over 0.84, significantly enhancing reliability in detecting inconsistencies and hallucinations.
Significant Cost and Latency Improvements
Achieved an 80% reduction in evaluation costs by transitioning from frequent LLM queries to a single, CPU-optimized 400M parameter alBERTa model.
Real-time inference latency is now sufficiently low for implementing guardrails and dynamic routing decisions without disrupting user experience.
Enhanced Agent Routing and Reduced Tool Call Errors
Introduced a "tool call quality" metric integrated into mailman routing, significantly reducing tool call failures (previously at 18%) due to issues like invalid filters.
Built a refined mailman classifier leveraging the same alBERTa model used for evaluations. Initial beta results show a 25% improvement in routing accuracy (AUROC increased to ~0.84), ensuring queries are directed to the most relevant agents, thus reducing hallucinations and improving user satisfaction.
Accelerated Improvement via Active Learning
Implemented targeted active learning, substantially reducing labeling efforts by domain experts while maximizing model quality improvements.
What started as a way to "measure system quality" has evolved into a comprehensive framework for developing trustworthy AI systems. By embedding evaluation deeply into the development process, we can continuously monitor, debug, and refine the system's behavior. This approach enables us to systematically collect the signals needed to assess and improve the system's performance, and alignment with business objectives.
The foundation laid by this workstream enables several key capabilities:
Data-Driven Optimization: Systematically identify improvement areas and validate system changes against quantitative metrics.
Proactive Quality Assurance: Surface potential issues before they impact end-users through real-time inference on model outputs.
Responsible Deployment: Implement safeguards and oversight to mitigate risks and ensure dherence to ethical principles and regulatory requirements.
Transparent Reporting: Provide stakeholders with clear, evidence-based insights into system performance and decision-making.
A key next step is to fully close the loop between evaluation and application development. By implementing active learning to automatically surface "uncertain" or high-impact examples for expert review, we can ensure the system continuously adapts to new data and requirements. This flywheel of measurement, adjustment, and retraining is crucial for maintaining performance as the underlying data evolves.
A flexible and robust multi-agent architecture lends itself to numerous high-value use cases across the enterprise, such as:
Knowledge Management: Enhance information discoverability and reuse by intelligently indexing and surfacing relevant content.
Process Efficiency: Streamline common workflows through AI-assisted triage, routing, and response generation.
Editorial Quality: Assist content creators by providing real-time feedback and suggestions grounded in established brand guidelines and performance metrics.
The Bertelsmann AI Hub-LastMile AI collaboration demonstrates how thoughtful, metrics-driven evaluation practices can be embedded into the core of an enterprise AI system. By making evaluation an integral part of the development process—not just a one-off audit—we can enable a more agile, transparent, and accountable approach to building and deploying AI.
This initiative provides a model for how large organizations can leverage AI to drive innovation while still ensuring responsible development and governance. Evaluation serves as the linchpin, providing the measurable, actionable insights needed to align model performance with organizational values and objectives.
The Bertelsmann Content Collaboration Search is a pioneering multi-agent RAG system that unites data from across one of the world’s most diverse media ecosystems. Its evaluation framework—co-developed with LastMile AI— elevates the system into a self-optimizing and cost-effective enterprise solution.
This synergy of measurement, improvement, and governance forms a blueprint for large organizations aiming to harness AI responsibly.
By weaving evaluation deeply into the pipeline, Bertelsmann has taken a bold step toward trustworthy AI—and with LastMile AI’s AutoEval platform, the foundation is set for future expansions (like process automation, editorial verification, and advanced compliance workflows).
Bertelsmann AI Hub Team: Visionaries behind multi-agent architecture and rigorous approach to metrics.
LastMile AI: Builders of the AutoEval platform, co-creators of compact evaluator models, and trusted partners in AI governance.
This case study reflects ongoing collaboration between Bertelsmann and LastMile AI, demonstrating how carefully crafted evaluation strategies can fuel both performance and responsible AI governance.