Harder, Better, Faster, Stronger: 🤖 LLM Hallucination Detection for Real-World RAG, Part I

Written By

Jonathan Lessinger

Published onMarch 12, 2024

In this post

TL;DR

We show that a small, specialized model (p-faithful-v0) can classify LLM outputs as faithful/hallucinations with accuracy similar to or better than much larger baseline models.
p-faithful-v0 is a tiny fraction of the size of the LLMs backing the baseline algorithms, unlocking large inference cost savings.
We introduce a new hallucination detection benchmark tailored for the development of real-world detector models.
We show an additional 23% (70% -> 93%) accuracy improvement with modest fine-tuning.

We are extending early access to our hallucination detection
(p-faithful-v0) model to test on a diverse set of use cases. If you need help evaluating your production RAG application, we would love to work with you.

Request Early Access.

Introduction

We’ve all figured out by now that our LLMs are lying to us. If not all the time, still with worrying frequency, appalling confidence, and an infuriating lack of shame. LastMile is fighting back: we present a small, specialized language model (“p-faithful-v0”) that achieves very high accuracy for its size on several hallucination detection benchmarks.

In this post, we show how LLM responses generated by a RAG (Retrieval Augmented Generation) application can be accurately and cheaply classified as either faithful or hallucinatory. The model has comparable accuracy to the baselines, with less than 1% of the parameters. We also present an additional benchmark, “Fully-Synthetic V2” (FSV2), a dataset of synthetic inputs and labels which aims to address many of the issues with the other benchmarks. We built this benchmark in order to more confidently make inferences about real industrial applications. We provide additional analysis on this synthetic data, showing that p-faithful-v0 can easily jump from 70% to 93% with modest fine-tuning.

Motivation & Problem Statement

LLMs can hallucinate even when they are directly fed ground truth data as part of the prompt, as is typical in a RAG setup. We want to be able to accurately, quickly, and cheaply detect when an LLM is hallucinating vs. when it is faithful to this ground-truth input data. Such detection can be used in application logic, e.g. to suppress high-risk responses, for batch offline analytics, for A/B testing, etc.

To benchmark detector accuracy, we followed the lead of the baseline detectors (discussed in more detail under “Baselines and Benchmarks” below). Each detector is a probabilistic binary classifier

p(faithful | question, data, response)

scoring input text from 0 to 1. The score estimates the faithfulness of the response to the reference, where 0 means certainly a hallucination. In RAG applications, where annotated reference answers are scarce if they exist at all, one can set the reference to be the data fed to the LLM (composed of the chunks retrieved for a particular query).

The benchmarks consist of rows, each with a question, data, and two answers. By construction, one of the two answers is faithful and the other is not (i.e. it is hallucinatory). The task is to predict which answer is which. This is done by scoring each triplet
(question, data, response1) and (question, data, response2)
from 0 to 1 using the classifier described above. The answer with the higher score is the predicted faithful answer, and the other one is the predicted hallucination.

For a given row, an evaluator is considered correct if it correctly identifies which answer is the faithful one (1 point). In the case of a tie (i.e. it scores the two answers the same) it is considered half correct (0.5 points).

Baselines & Benchmarks

We looked around the world 🤖 for some of the best open source LLM hallucination detectors, and used the benchmark datasets they cited. Specifically, we looked at RAGAS (w/ GPT-3.5) on WikiEval and Arize’s Phoenix (w/ GPT-4) on HaluEval. We did our best to reproduce their reported results as baselines (detailed below).

Fully-Synthetic V2 (FSV2): a New Synthetic Benchmark

Both existing benchmarks were interesting, but turned out to have a few central issues:

1. The tasks are too easy. By manual inspection and model analysis, we found that existing benchmarks consist of simple, often short text with easily-distinguishable faithful/hallucinatory pairs. Even if a detector does well on these benchmarks, we are not comfortable inferring that it would do well in the real world, where data is messier and hallucinations might be more varied, less clear, etc. Easy example from HaluEval:

Example pair from HaluEval (relevant words bolded). The contrived data and question, and answers are short and straight-forward. The hallucinated answer is obviously wrong. In contrast, real-world examples are likely to be much longer and messier, and may involve much more difficult comprehension and reasoning.

2. The datasets are small and bounded by manual labeling, especially in the case of WikiEval, limiting measurement confidence. WikiEval contains only 50 pairs. HaluEval is somewhat larger at several thousand examples.

3. LLM training data leak: Benchmarks generally use actual LLMs like GPT-4 to generate both positive and negative answers. If the data was seen during GPT-4’s training, then answers that were expected to be hallucinatory may in fact be in agreement with the data. If GPT-4 does get lucky 🤖 — so to speak — then a negative will be incorrectly labeled as a positive in the benchmark. We found cases like this in WikiEval, and suspect this problem exists in other datasets as well.

To address all 3 problems, we developed a data synthesis pipeline that automatically generates new automatically-labeled faithful / hallucinatory pairs. We used this flexible algorithm to generate FSV2, which is one of our key in-the-lab datasets. “Fully” synthetic means that it does not use any externally-available data, i.e. it does not suffer from the training data leak issue. The algorithm also makes it easier to generate large datasets. Finally, the task is hard: the hallucinations are by design as subtle as possible. They are pretty difficult to label even by hand (we did a lot of spot checking).

Limitations and Follow-Ups

Given that the data is fully synthetic, it does not directly simulate any real-world domain. However, if raw, unlabeled seed data are available from a production RAG system, the algorithm is flexible enough to simulate pairs that would be generated by that system. Such a “partially-synthetic” dataset would be bounded by the seed dataset, but would serve as a more realistic benchmark. We will move from the lab to the field using this setup in future work.

Next, how do we know the synthesized pairs are correct? There is an apparent circularity problem. We know LLMs hallucinate, so how can we generate correctly-labeled faithful/hallucinated answer pairs using LLMs? Typically, benchmarks rely on human annotation at some step, which breaks the cycle, but takes time, effort, and money.

In our case, we simply accept that we will get some hallucinations and design the algorithm around that. We loosen the requirements and only care that the answer pairs are good enough: i.e. that, with high probability, each answer pair resembles closely enough a hallucinated/faithful pair you might see in a golden labeled dataset.

This pairing property is essential to the dataset design, and we believe analytically (and by a healthy degree of spot checking) that the algorithm works better than alternatives for this purpose. While existing public benchmarks are useful as general-purpose comparisons across many approaches, we’re optimizing ours for specific production-grade RAG system evaluation.

Nevertheless, we will keep improving on FSV2 and empirically scrutinize its correctness in a future post. I’ll give you a minute for your eyebrows to return to your forehead before proceeding.

Results: Out of Domain Testing

For each baseline algorithm (RAGAS and Phoenix), we used their default hallucination detection settings, underlying LLM, and benchmarks (WikiEval and HaluEval, respectively). We attempted to reproduce their results using their code, and then used the same general procedure to test
p-faithful-v0. In the next two tables, all results are completely out-of-domain: no fine-tuning, no in-context-learning. (This is similar to, but technically a bit different from zero-shot classification. We’re not too worried about terminology, but just in case you were wondering.)

Baseline: RAGAS, GPT-3.5 (175B parameters)

Baseline: Phoenix, GPT-4 (175B+ parameters)

Our experimental results (pairwise accuracy).
The test set was used where data was sufficient to create one, else the full dataset.

*The authors report 95%, but we were only able to reproduce 85% at best, even using their code off the shelf. This could be due to the fact that they have not released their ground-truth labels, and we used a proxy task in this case.
Of course, we did this consistently across the detectors.

Results: Supervised Domain Adaptation via Fine-Tuning

We then did a deeper analysis using FSV2. Specifically, we wanted to see whether performance could be raised to a strong absolute level on our hardest benchmark using a small labeled training set (under 1000 examples in the target distribution). We fine-tuned for a few epochs with mostly default hyperparameters.

Baseline: p-faithful-v0 with no in-distribution fine-tuning

We plan to repeat this on other hard benchmarks, e.g. HaluEval 2.0. Note that for WikiEval, out-of-domain performance is already at 98%, so there is no point in fine-tuning on it.

FSV2 Accuracy vs. Other Benchmarks

We validated the hardness of the FSV2 task by comparing accuracy with other benchmarks. Every detector (including p-faithful-v0), does much worse on FSV2 than on the baseline benchmark. Here, we have just merged and reshaped the previous tables to control for detector and directly compare benchmarks.

Analysis & Takeaways

First, let’s notice that p-faithful-v0 (out-of-domain) does about as well as the baseline — or better — in each configuration. It’s also interesting that the default model for RAGAS is GPT-3.5, whereas Phoenix recommends GPT-4. This may explain why p-faithful-v0 performs similarly to Phoenix but better than RAGAS. In future posts, we will only look at the strongest available baselines.

We can also note the unsurprising but useful result that in-distribution fine-tuning (on only a few hundred examples) dramatically improves the performance of p-faithful-v0. This took only a few epochs on an A100 and involved very minimal hyperparameter tuning.

Finally, we should emphasize that we were able to empirically validate that FSV2 is a hard task. Every out-of-domain detector does much worse on that dataset than either WikiEval or HaluEval. FSV2 is not, however, completely noise: there is some meaningful distribution to learn (we would expect 50% accuracy for complete noise). Keep in mind that these accuracy numbers are directly comparable across datasets when controlling for detector. Each number is just the proportion of the rows for which a given detector got the answer pair correct.

In our pursuit to reliably and cheaply detect hallucinations in LLM responses from real-world RAG applications, we find that p-faithful-v0 performs similarly to (or better than) much more expensive alternatives. Furthermore, we present analyses using our new more realistic benchmark, FSV2. We bring p-faithful-v0 performance up from 70% to 93% with a little fine-tuning on a modest training set.

Key takeaway: In these experiments, the tiny p-faithful-v0 detector could reliably detect even the subtlest hallucinations we could find.

Get Early Access

We are expanding access to our hallucination detection model to test on diverse real world application data. If you are building RAG applications for production use cases, and want to evaluate them, we’d love to work with you.

What to expect?

You will get access to the model end-point for ad hoc testing (pre-trained model is available for out-of-domain inference).
We will work closely with you and your team to help you prep your data, and set up your experiments (including adapting the classifier for your domain).
We will work with your team if your use case requires the model to run on-prem/in a private environment.

Request Early Access