RAG Is Easy to Demo and Hard to Get Right
The engineering challenges hiding behind every 'semantic search' demo
Building a RAG demo takes an afternoon. Spin up a vector database, chunk some documents, embed them, retrieve on query, pass to GPT. Done. It feels like magic—until you try to use it for real. Then you discover that RAG is a deceptively deep engineering problem hiding behind a simple interface.
Why RAG Demos Are Misleading
Demos use clean, well-structured documents. Production data is messy: scanned PDFs, tables that span pages, code blocks, diagrams, inconsistent formatting. Your chunking strategy that worked beautifully on blog posts falls apart on technical manuals.
Demos cherry-pick queries where retrieval works. Production users ask ambiguous questions, use unexpected terminology, and want answers your documents don't contain. Your system hallucinates confidently instead of admitting uncertainty.
The Hard Problems in RAG
- →Chunking strategy: How big should chunks be? Fixed size or semantic boundaries? How do you handle tables, code, or multi-page sections? There's no universal answer—it depends on your data and queries.
- →Retrieval precision: Your system returns 5 chunks. Are they the right 5? How do you measure this? Golden test sets are hard to build and maintain, but without them, you're guessing.
- →Hallucination control: The model invents plausible-sounding answers using bits of retrieved context but adding details that aren't there. How do you detect and prevent this?
- →Query ambiguity: User asks 'How do I reset it?' Reset what? Without clarification, your system either guesses wrong or returns irrelevant chunks.
- →Missing information: User asks a question your documents don't answer. Does your system admit 'I don't know' or hallucinate an answer? Most systems hallucinate.
- →Metadata and filtering: You need to filter by document type, date, or access permissions. Now retrieval is multi-dimensional. How do you balance semantic similarity with metadata constraints?
What Production RAG Actually Requires
Hybrid retrieval: Dense embeddings alone aren't enough. Combine with sparse retrieval (BM25, keyword search) for better precision on specific terms and acronyms. Add reranking to improve top-k quality.
Structured evaluation: Build golden test sets—queries with known correct answers. Measure retrieval precision (are the right chunks in the top-k?) and answer quality (is the final response accurate?). Track these metrics on every change.
Citation requirements: Every claim in the generated response must link to a source chunk. No unsourced assertions. This dramatically reduces hallucination and builds user trust.
Confidence scoring: When retrieval quality is low (e.g., no chunks above similarity threshold), the system should say 'I don't have enough information' instead of guessing. This requires tuning thresholds and fallback logic.
Query preprocessing: Rewrite ambiguous queries for clarity. Expand abbreviations. Add context from conversation history. Small improvements in query quality have outsized effects on retrieval.
Chunking: The Underrated Bottleneck
Most teams chunk by token count (e.g., 512 tokens per chunk with 50 token overlap). This is simple but often wrong. It splits tables, code blocks, and multi-paragraph explanations in arbitrary places.
Better: semantic chunking that respects document structure. Chunk by section headers, paragraphs, or logical units. Keep tables and code blocks intact. Extract metadata (section title, document type, page number) for filtering.
There's no perfect chunking strategy. What works depends on your documents and queries. You need eval sets to test and iterate.
The Bottom Line
RAG is not a solved problem. It's a system engineering challenge disguised as an API call. The difference between a RAG demo and a production RAG system is the difference between a toy and a tool.
If you're building RAG, invest in evaluation infrastructure first. Build golden test sets. Measure retrieval and answer quality systematically. Iterate on chunking, retrieval, and prompt tuning based on data, not vibes. That's how you go from demo to production.