I Built a RAG Evaluation Framework from Scratch. Here’s What Broke It.

2026-05-05 · Source: Naturallanguageprocessing on Medium · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Intermediate, long

Summary

An evaluation framework for Retrieval Augmented Generation (RAG) pipelines was built from scratch using 30 ATSB aviation incident investigation reports and 150 evaluation questions across three difficulty levels. The framework utilized `all-MiniLM-L6-v2` for embeddings, ChromaDB as the vector store, LangChain's `RecursiveCharacterTextSplitter`, `pymupdf` for PDF parsing, and the Claude API as the evaluation judge. Experiments focused on optimizing chunk size, top-k retrieval, and chunk overlap, measuring Precision@k. Key findings include the interaction between chunk size and query complexity, the silent nature of retrieval failures, and the critical importance of high-quality evaluation sets. The study found that a chunk size of 500 tokens generally performed well, but 200 tokens were better for hard analytical questions. Increasing top-k consistently improved precision, while zero chunk overlap surprisingly outperformed configurations with overlap.

Key takeaway

For AI Engineers building RAG systems, your evaluation set is paramount; flawed questions can mask a well-performing pipeline. Prioritize building a high-quality, document-specific evaluation set before optimizing retrieval parameters. If your system handles diverse query types, implement a tiered approach: route factual lookups to smaller top-k and larger chunks, while complex analytical queries benefit from smaller chunks and larger top-k to balance efficiency and accuracy.

Key insights

Rigorous RAG evaluation reveals optimal configurations depend on query complexity and highlights the critical role of eval set quality.

Principles

Chunk size and query complexity interact.
Retrieval failures are often silent.
Eval set quality is foundational.

Method

A RAG evaluation framework was built using 30 technical documents and 150 questions, measuring Precision@k while varying chunk size, top-k retrieval, and chunk overlap to identify optimal configurations.

In practice

Start with 500-token chunks for mixed workloads.
Use 0-50 tokens for chunk overlap.
Consider k=10 for analytical queries.

Topics

RAG Evaluation Framework
Chunk Size Optimization
Retrieval Precision
Evaluation Dataset Quality
Query Complexity

Code references

sakshamnagpal/evaluate_rag_pipeline

Best for: AI Engineer, Machine Learning Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Naturallanguageprocessing on Medium.