RAG in Production: The Retrieval Failures Nobody Writes About

· Source: LLM on Medium · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Intermediate, medium

Summary

Production RAG systems frequently encounter subtle retrieval failures that lead to incorrect or inconsistent answers, despite appearing functional. This analysis identifies five common issues: semantic search returning topically related but irrelevant chunks, which is mitigated by a lightweight re-ranking step using a cross-encoder to score query-document relevance; naive chunking splitting critical context, addressed by semantic chunking with overlap that respects document structure; large language models ignoring provided context, overcome by explicit and redundant instructions, defining "I don't know" paths, and requiring citations; silent degradation due to outdated source documents, managed by content hashing and re-indexing only modified documents via a nightly Celery beat task; and a general lack of observability, resolved by logging full query traces and employing LLM-as-judge evaluations. These challenges highlight the need to treat retrieval as a core engineering problem, moving beyond basic vector similarity.

Key takeaway

For MLOps Engineers deploying RAG systems, recognize that demo-level pipelines are insufficient for production reliability. You must proactively engineer solutions for common retrieval failures: implement re-ranking to ensure relevance, adopt semantic chunking to preserve context, enforce LLM context adherence with explicit prompts, and use content hashing for data freshness. Your observability strategy should include logging full query traces to diagnose issues efficiently, transforming retrieval from a basic step into a robust engineering problem.

Key insights

Production RAG systems require robust engineering to overcome subtle retrieval and context-use failures.

Principles

Method

Implement a re-ranking step post-vector search using a cross-encoder. Employ semantic chunking with overlap, respecting document structure. Enforce context use via redundant instructions and "I don't know" paths. Maintain data freshness with content hashing.

In practice

Topics

Best for: Machine Learning Engineer, AI Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by LLM on Medium.