RAG Is Not Machine Learning, and the ML Toolkit Solves the Wrong Problem
Summary
The article argues that RAG is not machine learning, asserting that applying traditional ML toolkits to RAG problems is a costly misconception. Unlike ML, where answers are predicted, RAG problems involve finding existing answers within documents. The author details how common ML practices like hyperparameter optimization (e.g., chunk size, top-k), aggregate evaluation datasets, and feature-attribution explainability are misapplied in RAG. Instead, RAG system improvement stems from engineering efforts such as better parsing, precise retrieval, and clear prompting. The piece emphasizes viewing RAG as a search engine combined with an LLM for answer generation, where the system's intelligence resides in the development team's domain expertise, not the model itself. A case study illustrates how six months of ML-focused work failed to address a fundamental parsing issue, highlighting the importance of a structural, engineering-centric approach.
Key takeaway
For AI Engineers or MLOps teams building RAG systems, recognize that RAG is an engineering assembly problem, not a model training one. Stop optimizing "hyperparameters" like chunk size with ML tools; instead, structurally design retrieval strategies based on document and question types. Focus your evaluation on specific failure modes like parsing errors or retrieval recall, rather than aggregate accuracy, to diagnose and fix issues efficiently. This approach will prevent wasted effort and build more robust systems.
Key insights
RAG is an engineering problem, not a machine learning problem, requiring search system assembly and domain expertise.
Principles
- RAG failures are fixable bugs, not statistical noise.
- RAG explainability is documentary, not statistical.
- Intelligence in RAG systems resides in the team's domain expertise.
Method
Improve RAG by routing different question types to specific retrieval strategies, focusing on structural decisions over numerical optimization. Evaluate per-failure-mode metrics.
In practice
- Route questions to different chunking strategies (e.g., by line, section).
- Prioritize retrieval evaluation over generation evaluation.
- Provide citations as the primary explanation for RAG answers.
Topics
- Retrieval-Augmented Generation
- Information Retrieval
- RAG System Design
- Evaluation Metrics
- Prompt Engineering
- Document Intelligence
Code references
Best for: AI Engineer, MLOps Engineer, Director of AI/ML
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Towards Data Science.