TerraBench: Can Agents Reason Over Heterogeneous Earth-System Data?

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

TerraBench is a new benchmark for grounded Earth-science reasoning, coupled with TerraAgent, a ReAct-style executable framework. It addresses the gap where existing models either forecast well but lack interactive language reasoning (weather/climate foundation models) or reason in language but cannot operate on high-dimensional Earth-system data (LLMs). TerraBench unifies analysis of Earth observation imagery, gridded data, GIS reasoning, and simulation in a single interface, comprising 403 agentic tasks across three tracks and eight application domains with approximately 24,500 verified execution steps. The benchmark reveals significant limitations in current models; Claude Sonnet 4.6 achieves only 59.2 ToolUseScore and 22.9 Hit@tol, while Qwen3.5-35B trails at 40.0 and 5.9. Failures are primarily due to argument and numeric grounding issues, with over 84% of numerical answers falling outside acceptable error margins.

Key takeaway

For Machine Learning Engineers developing Earth-science agents, this benchmark highlights that basic tool access is insufficient. You must prioritize robust argument grounding and precise numerical output generation, as current frontier models fail significantly in these areas. Your development efforts should focus on improving agentic workflow orchestration and ensuring outputs remain within scientifically acceptable error margins, especially for simulator-grounded tasks, to achieve reliable climate reasoning.

Key insights

Current LLM agents struggle with precise numerical grounding and complex workflow orchestration in Earth-science tasks.

Principles

Method

TerraAgent uses a ReAct-style framework to interleave LLM planning with domain-specialized scientific tools for environmental retrieval, geospatial processing, simulation, and artifact-backed computation.

In practice

Topics

Code references

Best for: AI Scientist, Research Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.