Retrieval-Augmented Generation has become the default answer to "how do we give an LLM access to our data." The pattern is sound. The implementations are often not. After building RAG systems across industries — legal, healthcare, finance, SaaS — we keep seeing the same failure modes. Here's what they are and how to avoid them.
Failure Mode 1: Bad Chunking
Most teams chunk documents by character count and call it done. The problem is that semantic meaning does not align with character boundaries. A 500-character chunk that cuts a sentence in half, or splits a table from its header, produces retrievals that confuse the model rather than inform it.
Effective chunking requires understanding the document structure. For technical documentation, chunk by section. For contracts, chunk by clause. For conversational data, chunk by turn or topic boundary. There is no universal chunk size.
Failure Mode 2: Retrieval Without Re-ranking
Vector similarity retrieval is fast and scalable, but the top-k results by cosine similarity are not always the top-k results by relevance. A re-ranking step — even a lightweight cross-encoder — consistently improves response quality. Teams that skip this step are leaving significant accuracy on the table.
Failure Mode 3: No Evaluation Loop
RAG systems degrade over time as the underlying corpus changes and user query patterns shift. Without a systematic evaluation loop — ground truth queries, precision and recall tracking, regular re-indexing — teams fly blind. We build evaluation pipelines as a first-class deliverable, not an afterthought.
What Actually Works
Structure-aware chunking. Hybrid retrieval (dense + sparse). Re-ranking. Metadata filtering to narrow the retrieval space before semantic search. And a feedback loop that measures what matters: did the system answer the user's actual question correctly?
RAG done right is genuinely powerful. It just requires treating retrieval as an engineering problem, not a configuration exercise.