SPARTA: Scalable and Principled Benchmark of Tree-Structured Multi-hop QA over Text and Tables
Summary
SPARTA is an automated framework designed to generate large-scale, high-fidelity Table-Text Question Answering (QA) benchmarks. It addresses limitations in existing benchmarks, which are often small, manually curated, and lack complex multi-hop reasoning or advanced analytical operations like aggregation and grouping. SPARTA constructs a reference fact database by enriching source tables with atomic facts extracted from unstructured passages. It then synthesizes nested queries, ensuring executability and fluent natural-language questions through provenance-based refinement and realistic-structure enforcement. This pipeline generates thousands of question-answer pairs that demand deep multi-hop reasoning across text and tables. State-of-the-art models, which perform well on benchmarks like HybridQA (over 70 F1) and OTT-QA (over 50 F1), experience a significant performance drop of more than 30 F1 points on SPARTA, highlighting deficiencies in current cross-modal reasoning capabilities.
Key takeaway
For research scientists developing Table-Text QA models, SPARTA exposes critical gaps in current cross-modal reasoning. You should evaluate your models against SPARTA to identify weaknesses in handling complex multi-hop questions, aggregations, and grouping operations, and then focus your development efforts on improving these specific areas rather than relying solely on existing, less challenging benchmarks.
Key insights
SPARTA automatically generates complex multi-hop Table-Text QA benchmarks, revealing weaknesses in current cross-modal reasoning models.
Principles
- Automated generation reduces manual curation errors.
- Nested queries enable deep multi-hop reasoning.
- Provenance-based refinement ensures query executability.
Method
SPARTA constructs a fact database by enriching tables with text-extracted facts, then synthesizes nested queries using provenance-based refinement and realistic-structure enforcement to generate high-fidelity QA pairs.
In practice
- Use SPARTA to evaluate multi-hop QA model robustness.
- Analyze model failures on SPARTA's complex queries.
- Explore SPARTA's code for benchmark generation.
Topics
- Table-Text QA
- Multi-hop Reasoning
- Benchmark Generation
- Cross-modal Reasoning
- Complex Query Generation
Code references
Best for: Research Scientist, AI Researcher, AI Scientist, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.