DecompSR: A dataset for decomposed analyses of compositional multihop spatial reasoning
Summary
DecompSR is a new large benchmark dataset and generation framework, comprising over 5 million datapoints, designed to analyze compositional spatial reasoning in Large Language Models (LLMs). This framework allows for independently varying aspects of compositionality, including reasoning depth (productivity), entity and linguistic variability (substitutivity), input order and distractors (overgeneralisation), and novel linguistic elements (systematicity). DecompSR is procedurally constructed to be correct by design, with its accuracy independently verified using a symbolic solver. Benchmarking across various LLMs revealed that these models struggle with productive and systematic generalization in spatial reasoning tasks, though they demonstrate greater robustness to linguistic variation. This dataset offers a provably correct and rigorous tool for fine-grained probing of LLM compositional reasoning.
Key takeaway
For AI Scientists and Machine Learning Engineers evaluating LLM capabilities, DecompSR offers a critical tool to precisely diagnose compositional spatial reasoning weaknesses. You should integrate this dataset to independently vary aspects like reasoning depth and systematicity, moving beyond general benchmarks. This allows for targeted model improvements, especially in areas where current LLMs struggle, such as productive and systematic generalization, while recognizing their relative strength in handling linguistic variability.
Key insights
DecompSR is a provably correct dataset for fine-grained analysis of LLM compositional spatial reasoning.
Principles
- Compositionality has distinct, variable aspects.
- LLMs struggle with productive generalization.
- Linguistic variation is less challenging for LLMs.
Method
DecompSR is built procedurally, ensuring correctness by construction, and is independently verified using a symbolic solver to guarantee dataset accuracy for compositional spatial reasoning analysis.
In practice
- Vary reasoning depth in LLM evaluations.
- Test LLMs with novel linguistic elements.
- Assess LLM robustness to input order.
Topics
- DecompSR
- Spatial Reasoning
- Compositionality
- Large Language Models
- Benchmark Datasets
- LLM Evaluation
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.