A consultant runs a successful demo. Three months later, the same RAG system in production answers half the questions wrong, leaks documents users shouldn't see, and nobody can explain why. We've inherited enough of these systems to recognize the pattern.
The RAG fantasy
The fantasy of retrieval-augmented generation is that you can take an arbitrary corpus, chunk it, embed it, and a language model becomes an expert. This is true in a demo, where the corpus is curated, the questions are softballs, and the demo runs against the most expensive model the consultant could justify.
Production is different. The corpus contains 14 versions of the same policy. Some of those versions are confidential. The questions arrive from real users with real intent, and the model, when it doesn't know, makes something up that sounds correct enough to ship.
Four failure modes we see repeatedly
Across enterprises we've audited, four structural failures recur. They are not bugs. They are symptoms of an underlying design assumption: that semantic retrieval alone is sufficient.
1. The retrieval layer ignores the metadata that matters most
Vector similarity over chunks is one signal. Document version, jurisdiction, effective date, superseded-by, classification, owner, and locale are different signals. The first is usually all that's wired into the index. The latter are usually ignored or, worse, present but unused.
The consequence: the model retrieves the highest-similarity chunk regardless of whether it's from a document that was superseded six months ago, or written for the wrong jurisdiction, or marked confidential.
2. Permissions are bolted on after retrieval
A typical pattern: retrieve everything, then filter the response. This is wrong in two ways. First, the model has already seen documents the user shouldn't have access to, which means its reasoning can leak privileged content even when the citation is filtered. Second, post-hoc filtering produces inconsistent answers, the same question yields different responses to different users because the model rationalized different sources.
3. Evaluation is run once, never again
The system is evaluated on a curated test set during build. It ships. The corpus changes, the models change, the prompts change. Nobody runs the evaluation again. Drift goes unobserved until a regulator or an internal auditor catches a wrong answer in the wild.
4. The agent has no boundary
Without a clear specification of what the system will and will not answer, the model attempts everything. It interprets ambiguous queries as best it can. It pattern-matches to the closest document. It produces answers that are confidently wrong because they are confidently retrieved.
Metadata first, embeddings later
The fix begins with the realization that enterprise retrieval is a structured search problem with a semantic component, not a semantic search problem with structured metadata as a side decoration. We design retrieval pipelines that filter on structured attributes first, then rank on semantic similarity within the filtered set.
// Conceptual sketch, production code adds eval hooks, telemetry, etc.
const candidates = await documents.filter({
jurisdiction: user.context.jurisdiction,
version_status: 'current', // never superseded
classification: { lte: user.clearance },
locale: user.locale,
effective_date: { lte: today },
});
const ranked = await vector.search({
query: embed(query),
scope: candidates.map(c => c.id), // ranks ONLY within filtered set
k: 12,
});
const reranked = await reranker.rank(query, ranked);The structural filter is non-negotiable. The vector store is a ranker over an already-correct candidate set. It does not get to surface documents the user cannot read or that are not current.
Permissions are not a retrieval problem. They are an identity problem.
The right boundary is the user's identity at the moment of the query. Retrieval runs under that identity. Generation runs against an already-permission-filtered context. The model never sees what the user cannot.
This requires that the underlying systems, DMS, SharePoint, S3, registries, whatever holds the documents, expose access-controlled APIs that the retrieval layer can call as the user. In environments where this is not possible, we build a permission projection: a separate index per role, or per clearance band, refreshed as identity attributes change.
Evaluation is the system. Everything else is decoration.
We treat evaluation infrastructure as the load-bearing structure of a production RAG system. The retrieval logic, the prompts, the model, all of those will change. The evaluation suite is what makes change safe.
Concretely, we build:
- Golden datasets, 100 to 1,000 curated question/answer/citation triples per use case, owned by domain experts and versioned alongside the codebase.
- Grounding evaluations, every assertion the model makes must trace to a retrieved chunk. We measure this automatically and gate deploys on it.
- Adversarial datasets, questions designed to provoke hallucination, jailbreak, or out-of-scope responses. The system must refuse these reliably.
- Drift detection, periodic re-runs against production traffic samples. Regressions surface within hours, not quarters.
“If you do not have a published golden dataset for your retrieval system, you do not have a retrieval system. You have a prototype that hasn't failed yet in the way you'll find embarrassing.from a post-mortem we wrote in early 2025
What to design instead
The systems that work in production share a small set of properties. None of them are about which model you use.
- Structured candidate selection. Filter on jurisdiction, version, classification, locale, and effective date before semantic ranking. The vector store is a ranker, not a router.
- Identity-bound retrieval. The user is a parameter of retrieval. The model sees only what the user could see if they did the search themselves.
- Explicit scope. Specify what the system will not answer. Refuse cleanly. Hand off to humans. Refusal is a feature, not a bug.
- Citations as a contract. Every assertion is link-able to its source paragraph. Users, and auditors, verify by clicking.
- Continuous evaluation. Golden datasets, grounding evaluations, adversarial sets, and drift detection run in CI. The system regresses, and you know within hours.
- Audit by default.Every retrieval, every generation, every citation is logged with reproducibility guarantees. The auditor's question, “why did the system say this?”, has a reproducible answer for every past response.
None of this is exotic. None of it is research. It is what production engineering looks like when applied to a retrieval system. The reason most enterprise RAG fails is not technical. It is that the systems were built to demo, not to operate. Demo systems show that something is possible. Production systems hold up under conditions that no demo can simulate.
We've rebuilt enough of these systems to write this with confidence. If you've inherited a RAG system that's quietly failing in production, or you're about to build one, we can help. Most of our compliance and document-intelligence engagements begin with rebuilding retrieval on these principles.