SPARTA: Scalable and Principled Benchmark of Tree-Structured Multi-hop QA over Text and Tables

2026-02-26 · Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, quick

Summary

SPARTA is an automated framework designed to generate large-scale, high-fidelity Table-Text Question Answering (QA) benchmarks. It addresses limitations in existing benchmarks, which are often small, manually curated, and lack complex multi-hop reasoning or advanced analytical operations like aggregation and grouping. SPARTA constructs a reference fact database by enriching source tables with atomic facts extracted from unstructured passages. It then synthesizes nested queries, ensuring executability and fluent natural-language questions through provenance-based refinement and realistic-structure enforcement. This pipeline generates thousands of question-answer pairs that demand deep multi-hop reasoning across text and tables. State-of-the-art models, which perform well on benchmarks like HybridQA (over 70 F1) and OTT-QA (over 50 F1), experience a significant performance drop of more than 30 F1 points on SPARTA, highlighting deficiencies in current cross-modal reasoning capabilities.

Key takeaway

For research scientists developing Table-Text QA models, SPARTA exposes critical gaps in current cross-modal reasoning. You should evaluate your models against SPARTA to identify weaknesses in handling complex multi-hop questions, aggregations, and grouping operations, and then focus your development efforts on improving these specific areas rather than relying solely on existing, less challenging benchmarks.

Key insights

SPARTA automatically generates complex multi-hop Table-Text QA benchmarks, revealing weaknesses in current cross-modal reasoning models.

Principles

Automated generation reduces manual curation errors.
Nested queries enable deep multi-hop reasoning.
Provenance-based refinement ensures query executability.

Method

SPARTA constructs a fact database by enriching tables with text-extracted facts, then synthesizes nested queries using provenance-based refinement and realistic-structure enforcement to generate high-fidelity QA pairs.

In practice

Use SPARTA to evaluate multi-hop QA model robustness.
Analyze model failures on SPARTA's complex queries.
Explore SPARTA's code for benchmark generation.

Topics

Table-Text QA
Multi-hop Reasoning
Benchmark Generation
Cross-modal Reasoning
Complex Query Generation

Code references

pshlego/SPARTA

Best for: Research Scientist, AI Researcher, AI Scientist, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.