I Built a RAG Evaluation Framework from Scratch. Here’s What Broke It.
Summary
An evaluation framework for Retrieval Augmented Generation (RAG) pipelines was built from scratch using 30 ATSB aviation incident investigation reports and 150 evaluation questions across three difficulty levels. The framework utilized `all-MiniLM-L6-v2` for embeddings, ChromaDB as the vector store, LangChain's `RecursiveCharacterTextSplitter`, `pymupdf` for PDF parsing, and the Claude API as the evaluation judge. Experiments focused on optimizing chunk size, top-k retrieval, and chunk overlap, measuring Precision@k. Key findings include the interaction between chunk size and query complexity, the silent nature of retrieval failures, and the critical importance of high-quality evaluation sets. The study found that a chunk size of 500 tokens generally performed well, but 200 tokens were better for hard analytical questions. Increasing top-k consistently improved precision, while zero chunk overlap surprisingly outperformed configurations with overlap.
Key takeaway
For AI Engineers building RAG systems, your evaluation set is paramount; flawed questions can mask a well-performing pipeline. Prioritize building a high-quality, document-specific evaluation set before optimizing retrieval parameters. If your system handles diverse query types, implement a tiered approach: route factual lookups to smaller top-k and larger chunks, while complex analytical queries benefit from smaller chunks and larger top-k to balance efficiency and accuracy.
Key insights
Rigorous RAG evaluation reveals optimal configurations depend on query complexity and highlights the critical role of eval set quality.
Principles
- Chunk size and query complexity interact.
- Retrieval failures are often silent.
- Eval set quality is foundational.
Method
A RAG evaluation framework was built using 30 technical documents and 150 questions, measuring Precision@k while varying chunk size, top-k retrieval, and chunk overlap to identify optimal configurations.
In practice
- Start with 500-token chunks for mixed workloads.
- Use 0-50 tokens for chunk overlap.
- Consider k=10 for analytical queries.
Topics
- RAG Evaluation Framework
- Chunk Size Optimization
- Retrieval Precision
- Evaluation Dataset Quality
- Query Complexity
Code references
Best for: AI Engineer, Machine Learning Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Naturallanguageprocessing on Medium.