DecompSR: A dataset for decomposed analyses of compositional multihop spatial reasoning

2025-11-04 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, quick

Summary

DecompSR is a new large benchmark dataset and generation framework, comprising over 5 million datapoints, designed to analyze compositional spatial reasoning in Large Language Models (LLMs). This framework allows for independently varying aspects of compositionality, including reasoning depth (productivity), entity and linguistic variability (substitutivity), input order and distractors (overgeneralisation), and novel linguistic elements (systematicity). DecompSR is procedurally constructed to be correct by design, with its accuracy independently verified using a symbolic solver. Benchmarking across various LLMs revealed that these models struggle with productive and systematic generalization in spatial reasoning tasks, though they demonstrate greater robustness to linguistic variation. This dataset offers a provably correct and rigorous tool for fine-grained probing of LLM compositional reasoning.

Key takeaway

For AI Scientists and Machine Learning Engineers evaluating LLM capabilities, DecompSR offers a critical tool to precisely diagnose compositional spatial reasoning weaknesses. You should integrate this dataset to independently vary aspects like reasoning depth and systematicity, moving beyond general benchmarks. This allows for targeted model improvements, especially in areas where current LLMs struggle, such as productive and systematic generalization, while recognizing their relative strength in handling linguistic variability.

Key insights

DecompSR is a provably correct dataset for fine-grained analysis of LLM compositional spatial reasoning.

Principles

Compositionality has distinct, variable aspects.
LLMs struggle with productive generalization.
Linguistic variation is less challenging for LLMs.

Method

DecompSR is built procedurally, ensuring correctness by construction, and is independently verified using a symbolic solver to guarantee dataset accuracy for compositional spatial reasoning analysis.

In practice

Vary reasoning depth in LLM evaluations.
Test LLMs with novel linguistic elements.
Assess LLM robustness to input order.

Topics

DecompSR
Spatial Reasoning
Compositionality
Large Language Models
Benchmark Datasets
LLM Evaluation

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.