How Fine-Grained Should a RAG Benchmark Be? A Hierarchical Framework for Synthetic Question Generation

2026-06-11 · Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, quick

Summary

HieraRAG is a hierarchical framework designed to guide practitioners in determining the optimal granularity for retrieval-augmented generation (RAG) system benchmarks. It defines optimal granularity as the level that maximizes discriminative power, measured by the standard deviation of generation quality across categories within a specific RAG configuration. As a case study, the framework generated 5,872 synthetic question-answer pairs from FineWeb-10BT, varying three dimensions—Question Complexity, Answer Type, and Linguistic Variation—across 2, 4, and 8 granularity levels. Using a BM25+Falcon-3-10B pipeline, findings showed optimal granularity is dimension-dependent: complexity benefits from fine-grained distinctions (discriminative power: 0.053), while answer type and linguistic variation perform best at medium granularity. HieraRAG also introduces a Coherence Ratio metric to assess how cleanly fine-grained splits subdivide parent categories, revealing structural differences like Question Complexity at 0.40 versus Answer Type at 1.44. Human evaluation of 110 QA pairs validated the synthetic data quality, confirming HieraRAG's utility as a portable procedure for RAG evaluation.

Key takeaway

For RAG system developers designing evaluation benchmarks, your approach to question granularity should not be uniform. You should adopt HieraRAG's portable procedure to empirically determine the optimal, dimension-specific granularity for your RAG configuration. This means applying fine-grained distinctions for question complexity while using medium granularity for aspects like answer type and linguistic variation, thereby maximizing discriminative power and ensuring more accurate system evaluation.

Key insights

Optimal RAG benchmark granularity varies by dimension, maximizing discriminative power for effective evaluation.

Principles

Optimal granularity maximizes discriminative power.
Question dimensions require varied granularity.
Coherence Ratio assesses category subdivision.

Method

HieraRAG defines optimal granularity by maximizing discriminative power. It generates synthetic QA pairs across hierarchical dimensions and levels, then evaluates performance and category subdivision using a Coherence Ratio metric.

In practice

Apply HieraRAG to tailor RAG benchmark granularity.
Use fine-grained splits for question complexity.
Test medium granularity for answer type, linguistic variation.

Topics

Retrieval-Augmented Generation
RAG Benchmarking
Synthetic Question Generation
Evaluation Granularity
Discriminative Power
Coherence Ratio

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.