Sat Nov 08 2025 00:00:00 GMT+0000 (Coordinated Universal Time)

Why simple beats clever in RAG

Notes from building a multi-document RAG pipeline that wouldn't stop almost-working.

I shipped a Multi-Document RAG system over the second half of 2025. The version that actually worked was less interesting than the four versions before it.

This essay is about why.

The trap

RAG is alluring because every layer feels like a place where cleverness pays off. Better chunking. Smarter re-ranking. Multi-hop retrieval. Self-querying. Each new trick promises a step-change in answer quality.

In practice, every layer is also a new place for the system to almost-work, to produce plausible answers that are wrong in non-obvious ways.

The moment for me was version 3 of the pipeline. It had hierarchical chunking (parent → section → paragraph), a cross-encoder re-ranker, and an agentic retriever that could reason about what to fetch next. On the canonical demo queries (the ones where I'd been testing all along), it was indistinguishable from version 1. On long-tail queries, where a single short paragraph buried in one document held the only relevant answer, version 3 was meaningfully worse.

I didn't catch this with an A/B run or an eval suite. I caught it by reading logs. The retrieved chunks for the long-tail queries were always adjacent to the right paragraph, never the paragraph itself. Each clever layer was pulling the system slightly off-target in the same direction. Three small biases stacked into one consistent failure.

Where simple wins

Three places I kept losing this argument with my past self:

1. Chunking

I started with hierarchical chunking: parent docs, sections, paragraphs, sentences. The retriever could "zoom in." It was clever.

It also turned out that a flat 400-token sliding window with 50-token overlap retrieved the right context more often, because the embedding model had a much better signal on chunk-sized passages than on either tiny sentences or whole sections.

The specific failure was parent-doc dominance. Section-level chunks consistently out-scored the leaf paragraphs nested inside them, because the embedding of a "section about X" was a denser summary signal than any individual paragraph that unpacked X in detail. Top-k retrieval kept handing me the section header and its short summary blurb, exactly the kind of context the LLM can't actually answer from. Flattening the index killed the hierarchy as a competitor and the paragraph-shaped chunks won.

2. Retrieval

Agentic retrieval (letting the LLM decide what to search for next) feels like the right shape for hard queries. The narrow class where it actually paid off was synonym-heavy queries: the user's phrasing didn't match the document vocabulary, and the agent's second pass could reformulate the search.

For everything else, the honest answer is that accuracy was roughly the same as single-shot retrieval. The real cost was elsewhere. Each query took about three times as long. Each query cost more, in tokens and in patience. Each query was harder to inspect when it went wrong, because "what did the system actually fetch?" was now a multi-turn trace rather than a single ranked list.

Most queries are not the hard query. You pay the multi-hop cost on every request to make a small minority of them better.

3. Failure documentation > adversarial query handling

The most useful thing I built wasn't a retrieval improvement. It was a logbook of queries the pipeline had gotten wrong, with the actual retrieved chunks and the generated answer next to the expected answer.

Looking at that logbook for ten minutes did more to improve the system than any single technique I added on top of it.

The pattern the logbook surfaced wasn't a retrieval problem at all. The retrieved chunks were correct. The generated answers, on a specific subclass of queries, just ignored them: the LLM had latched onto its parametric memory of an adjacent topic and was producing fluent, plausible, wrong answers despite having the actual source paragraph in context.

You can't fix that with smarter retrieval. The clever-retrieval layer was working fine; the problem was downstream of it. The fix was almost embarrassingly small: a post-hoc validation step that compared the answer back against the retrieved chunks and flagged anything that introduced facts not present in any chunk. A hundred lines of straight Python beating four hundred lines of orchestration I'd written in the layers above it.

The principle

Every layer of orchestration adds surface area for failure modes. Simplicity isn't a virtue in itself, but in RAG specifically, each clever layer makes the system harder to debug while making the failure cases harder to even notice.

Build the dumb version first. Document every place it's wrong. Add complexity only where the dumb version's failure is both recurrent and not fixable upstream.

Reframe the three moves and they collapse into one. Flattening the chunk hierarchy was knowing-when-not-to. Killing the agentic retriever was knowing-when-not-to. Writing the post-hoc filter instead of layer five was knowing-when-not-to. Building clever pipelines is a craft skill. Knowing when to stop is a different one: quieter, harder to demo, only legible to people who've watched a system almost-work for long enough to recognize what almost-working costs them. The logbook is what taught me the second skill. The first I already had.