TerraBench: Can Agents Reason Over Heterogeneous Earth-System Data?
Summary
TerraBench is a new benchmark for grounded Earth-science reasoning, coupled with TerraAgent, a ReAct-style executable framework. It addresses the gap where existing models either forecast well but lack interactive language reasoning (weather/climate foundation models) or reason in language but cannot operate on high-dimensional Earth-system data (LLMs). TerraBench unifies analysis of Earth observation imagery, gridded data, GIS reasoning, and simulation in a single interface, comprising 403 agentic tasks across three tracks and eight application domains with approximately 24,500 verified execution steps. The benchmark reveals significant limitations in current models; Claude Sonnet 4.6 achieves only 59.2 ToolUseScore and 22.9 Hit@tol, while Qwen3.5-35B trails at 40.0 and 5.9. Failures are primarily due to argument and numeric grounding issues, with over 84% of numerical answers falling outside acceptable error margins.
Key takeaway
For Machine Learning Engineers developing Earth-science agents, this benchmark highlights that basic tool access is insufficient. You must prioritize robust argument grounding and precise numerical output generation, as current frontier models fail significantly in these areas. Your development efforts should focus on improving agentic workflow orchestration and ensuring outputs remain within scientifically acceptable error margins, especially for simulator-grounded tasks, to achieve reliable climate reasoning.
Key insights
Current LLM agents struggle with precise numerical grounding and complex workflow orchestration in Earth-science tasks.
Principles
- Scientific agents need unified evaluation across heterogeneous data.
- Process-level tool-use metrics must pair with tolerance-aware numeric scoring.
- Reliable agents require precise tool parameterization and artifact provenance.
Method
TerraAgent uses a ReAct-style framework to interleave LLM planning with domain-specialized scientific tools for environmental retrieval, geospatial processing, simulation, and artifact-backed computation.
In practice
- Use TerraBench to evaluate LLM agents on Earth-science tasks.
- Focus agent development on argument grounding and numeric precision.
- Implement artifact-centered design for auditable scientific workflows.
Topics
- Earth-System Data
- LLM Agents
- Scientific Benchmarking
- Geospatial Reasoning
- Climate Modeling
- Tool-Augmented LLMs
Code references
Best for: AI Scientist, Research Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.