In this post
The Guide to Evaluating Retrieval-Augmented Generation (RAG) systems
Over the past several months, I’ve worked with startups and companies stuck in the proof-of-concept stage. These companies were in a loop of identifying an inaccuracy, applying a one-off fix, and repeating, in the hopes of hitting their accuracy goal. From these conversations and collaborations, it’s become clear that a good testing and evaluation framework was the key to finally transitioning a POC into production. I’m sharing my learnings and takeaways for others who may be encountering similar problems.
As of today, RAG application builders are able to get a functional proof-of-concept complete with relative ease. But, in order to meet accuracy goals to deploy from proof-of-concept into production, developers are encountering many challenges. I’ve seen the regression in performance after a developer adds a “response should be brief” to the system prompt. I’ve seen the chunking nightmares where it starts with the chunk size just needs to be a few token larger and ends up with complex dynamic chunking sizes based on document classifications. At times it feels like a hopeless journey questioning whether an accuracy goal of 95% is even possible.
Taking a page from traditional ML, setting up a good evaluation test is important in navigating the decision making on how to incremental make progress. I’ll talk about the ideal evaluation (unfortunately, there are limitations) and the practical setup to evaluations.
Yes, there is a solution to this problem to definitively evaluate these RAG systems. The perfect evaluation framework is people. People are the only ones who can definitively judge whether a RAG system is responding appropriately or not, and there are two approaches to incorporating people into the system:
Human-in-the-loop — introducing people into the system to review and/or validate the results from the RAG or LLM application.
Online Experimentation — typically as an A/B test, driving a portion of production traffic from real users to the RAG or LLM application. Measuring the success of the experiment based on key metrics.
If both human-in-the-loop and online experimentation are solutions to evaluating RAG applications, why is evaluation still an unsolved problem?
Well, the challenges that most companies are encountering are that in most cases, it’s not feasible to instrument these solutions into their RAG applications. The reasons being:
Human-in-the-loop typically requires subject matter experts (SMEs) to review and validate responses. SMEs can precisely determine whether an LLM response is correct or incorrect, especially in more nuanced cases. However, SMEs’ times are valuable and it’s not always feasible to have SMEs continuously review responses. You can reduce the time needed by randomly sampling a small portion of the LLM responses for review, but this increases the likelihood of edge cases accidentally being released to production.
Online experimentation is a very powerful approach to confidently understanding whether a RAG application is positively moving success metrics. However, companies have concerns with pushing LLMs to production without a stronger guarantee on how misinformation won’t be generated (e.g. What Air Canada Lost in ‘Remarkable Lying AI Chatbot Case)
In absence of the ideal solution, I’ve found that a combination of evaluation techniques serves as the approximate for evaluating the end-to-end RAG application. The best approach consists of three categories: 1. the vibe-check; 2. evaluators; and 3. collecting user feedback.
Vibe-checks — although this isn’t the most scientific approach, I have found that spot checking from SMEs (engineers, PMs, biz ops, etc.) is a viable and necessary step in improving a RAG application performance. The difference between those who have already productionized a RAG application vs. those who have not is that the experienced people are organized with their “spot checking”. They neatly label the input and output with a ground truth label. They persist this data point to be used for evaluation runs, integration tests, or regression tests. My recommendation is to invest time in a good tool to help version control and keep track of your ground truth data. You don’t want to regret it later on when you find that you don’t have any ground truth data and you have to start from scratch.
Evaluators — Functions to measure the performance of your application. Evaluators can take inputs, outputs, and optionally ground truth data to provide meaningful metrics on whether your RAG application is improving across the metrics you care about. Across the applications I’ve worked with, it’s been difficult to derive a single composite metric that can accurately measure whether an LLM response with context yields better performance or not, I have found that evaluators are great indicators on whether the performance you care about is improving. There are many ways to segment the different types of evaluators (e.g. with/without labels, NLP / CV , etc. ), but for this document, I’ll segment off of three types of implementations of evaluators: heuristic-based, evaluator models, LLM prompting
- Heuristic-based — traditional NLP functions (BLEU, ROGUE, etc.) to measure the accuracy of output text against ground truth. I have found that these metrics have been less useful of late to measure performance of a RAG application’s output.
- Evaluator model — smaller, specialized classification models to validate the performance (faithfulness, answer relevancy, context relevancy, semantic similarity, etc.). I have found that this approach doesn’t sacrifice on performance compared to LLM prompting, but is significantly more cost and latency efficient. Given the benefits, we, here at LastMile AI, released our own evaluator models (p-faithful).
- LLM prompting — the current state of the art is to prompt current state-of-the-art LLMs to judge the performance of the RAG application. This approach is often-times too expensive, high latency, and imprecise (i.e. the numerical significance of the response from the LLM lacks statistical significance).User feedback — A far cry from online experimentation against external users, but almost all RAG applications go through a dogfooding phase or internal testing. Integrating RAG applications to collect well-organized labels from internal users or dogfooders is an important way to acquire high-quality labels to further improve your application.
With a combination of these 3 techniques, I’ve found that companies and users have the best framework in place to get to the performance that meets their bar for production.
Evaluation for RAG systems is still an unsolved problem, but I’m betting that with more automation of these techniques and personalization, the state-of-the-art will advance further. If you’re interested in automating or fine-tuning your evaluations, feel free to connect with me at andrew@lastmileai.dev and check us out @ LastMile AI.