WhyyourRAGsystemislyingtoyou(andhowtocatchit)
Most RAG pipelines hallucinate more than teams realize. Here's the eval framework we use on every engagement to measure faithfulness, relevance, and groundedness — before users find the bugs for you.
Most teams ship their RAG system the moment it returns plausible-looking answers to five hand-picked questions. Then they wonder why support tickets start rolling in about "wrong" answers that the AI delivered with perfect confidence. The problem isn't the LLM — it's the eval gap.
We've built RAG systems for legal, fintech, and healthcare clients. Every single one hallucinated in ways that only showed up under systematic evaluation. Here's the framework we now use on every engagement.
The three failure modes
RAG systems fail in three distinct ways, and most teams only test for one of them:
"The retriever found the right document, but the LLM ignored it and made something up."
- Retrieval failure — the right chunks never make it into the context window. Your embedding model doesn't understand domain jargon, your chunking strategy splits key paragraphs, or your top-k is too low.
- Synthesis failure — the right chunks are in context, but the model ignores them, paraphrases incorrectly, or merges information from unrelated chunks.
- Grounding failure — the answer looks correct and cites a source, but the citation doesn't actually support the claim. This is the hardest to catch and the most dangerous.
Building an eval harness
An eval harness doesn't need to be complicated. At minimum, you need:
# Minimal RAG eval structure
eval_cases = [
{
"query": "What is the late payment penalty?",
"expected_chunks": ["contract-sec-4.2"],
"expected_answer_contains": ["2.5%", "monthly"],
"expected_answer_excludes": ["annual", "waived"],
},
# ... 50-200 cases covering your domain
]
def run_eval(pipeline, cases):
results = []
for case in cases:
retrieved = pipeline.retrieve(case["query"])
answer = pipeline.generate(case["query"], retrieved)
results.append({
"retrieval_recall": chunk_recall(retrieved, case["expected_chunks"]),
"faithfulness": check_faithfulness(answer, retrieved),
"correctness": check_answer(answer, case),
})
return aggregate(results)
The key insight is that you need to evaluate retrieval and generation separately. A system can have perfect retrieval and terrible generation, or vice versa. If you only check the final answer, you can't diagnose which component is failing.
Metrics that matter
We track five metrics on every RAG engagement:
- Retrieval recall@k — what percentage of relevant chunks appear in the top-k results?
- Retrieval precision@k — what percentage of retrieved chunks are actually relevant?
- Faithfulness — does the generated answer only contain claims supported by the retrieved context?
- Answer relevance — does the answer actually address the question asked?
- Groundedness — can every claim in the answer be traced to a specific chunk?
We use Langfuse to track these over time. The dashboard shows trends, so you can see if a reindexing job or prompt change degraded quality before users notice.
The chunking trap
90% of RAG quality issues trace back to chunking. The defaults in LangChain and LlamaIndex — 1000 tokens with 200 overlap — are almost never right for domain-specific content. Legal contracts need clause-level splitting. Medical records need section-aware parsing. Financial reports need table-preserving extraction.
# Domain-aware chunking for legal contracts
def chunk_contract(text: str) -> list[Chunk]:
sections = split_by_clause_headers(text)
chunks = []
for section in sections:
if len(section.tokens) > MAX_CHUNK:
# Split long clauses at paragraph boundaries
chunks.extend(split_at_paragraphs(section))
else:
chunks.append(section)
# Preserve cross-references as metadata
for chunk in chunks:
chunk.metadata["references"] = extract_cross_refs(chunk.text)
return chunks
What we ship
Every RAG engagement at KODEVS includes:
- A versioned eval dataset with 50–200 test cases
- Automated eval runs on every pipeline change
- Langfuse integration for production monitoring
- A runbook for when metrics drop below thresholds
- Monthly eval reviews for the first quarter
The eval harness typically takes 2–3 days to build. It has saved every single client from shipping a system that would have eroded user trust. That's not a nice-to-have — it's the difference between an AI feature users rely on and one they learn to ignore.