How Fine-Grained Should a RAG Benchmark Be? A Hierarchical Framework for Synthetic Question Generation
Summary
HieraRAG is a hierarchical framework designed to guide practitioners in determining the optimal granularity for retrieval-augmented generation (RAG) system benchmarks. It defines optimal granularity as the level that maximizes discriminative power, measured by the standard deviation of generation quality across categories within a specific RAG configuration. As a case study, the framework generated 5,872 synthetic question-answer pairs from FineWeb-10BT, varying three dimensions—Question Complexity, Answer Type, and Linguistic Variation—across 2, 4, and 8 granularity levels. Using a BM25+Falcon-3-10B pipeline, findings showed optimal granularity is dimension-dependent: complexity benefits from fine-grained distinctions (discriminative power: 0.053), while answer type and linguistic variation perform best at medium granularity. HieraRAG also introduces a Coherence Ratio metric to assess how cleanly fine-grained splits subdivide parent categories, revealing structural differences like Question Complexity at 0.40 versus Answer Type at 1.44. Human evaluation of 110 QA pairs validated the synthetic data quality, confirming HieraRAG's utility as a portable procedure for RAG evaluation.
Key takeaway
For RAG system developers designing evaluation benchmarks, your approach to question granularity should not be uniform. You should adopt HieraRAG's portable procedure to empirically determine the optimal, dimension-specific granularity for your RAG configuration. This means applying fine-grained distinctions for question complexity while using medium granularity for aspects like answer type and linguistic variation, thereby maximizing discriminative power and ensuring more accurate system evaluation.
Key insights
Optimal RAG benchmark granularity varies by dimension, maximizing discriminative power for effective evaluation.
Principles
- Optimal granularity maximizes discriminative power.
- Question dimensions require varied granularity.
- Coherence Ratio assesses category subdivision.
Method
HieraRAG defines optimal granularity by maximizing discriminative power. It generates synthetic QA pairs across hierarchical dimensions and levels, then evaluates performance and category subdivision using a Coherence Ratio metric.
In practice
- Apply HieraRAG to tailor RAG benchmark granularity.
- Use fine-grained splits for question complexity.
- Test medium granularity for answer type, linguistic variation.
Topics
- Retrieval-Augmented Generation
- RAG Benchmarking
- Synthetic Question Generation
- Evaluation Granularity
- Discriminative Power
- Coherence Ratio
Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.