TerraBench: Can Agents Reason Over Heterogeneous Earth-System Data?

2026-06-12 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

TerraBench is a new benchmark for grounded Earth-science reasoning, coupled with TerraAgent, a ReAct-style executable framework. It addresses the gap where existing models either forecast well but lack interactive language reasoning (weather/climate foundation models) or reason in language but cannot operate on high-dimensional Earth-system data (LLMs). TerraBench unifies analysis of Earth observation imagery, gridded data, GIS reasoning, and simulation in a single interface, comprising 403 agentic tasks across three tracks and eight application domains with approximately 24,500 verified execution steps. The benchmark reveals significant limitations in current models; Claude Sonnet 4.6 achieves only 59.2 ToolUseScore and 22.9 Hit@tol, while Qwen3.5-35B trails at 40.0 and 5.9. Failures are primarily due to argument and numeric grounding issues, with over 84% of numerical answers falling outside acceptable error margins.

Key takeaway

For Machine Learning Engineers developing Earth-science agents, this benchmark highlights that basic tool access is insufficient. You must prioritize robust argument grounding and precise numerical output generation, as current frontier models fail significantly in these areas. Your development efforts should focus on improving agentic workflow orchestration and ensuring outputs remain within scientifically acceptable error margins, especially for simulator-grounded tasks, to achieve reliable climate reasoning.

Key insights

Current LLM agents struggle with precise numerical grounding and complex workflow orchestration in Earth-science tasks.

Principles

Scientific agents need unified evaluation across heterogeneous data.
Process-level tool-use metrics must pair with tolerance-aware numeric scoring.
Reliable agents require precise tool parameterization and artifact provenance.

Method

TerraAgent uses a ReAct-style framework to interleave LLM planning with domain-specialized scientific tools for environmental retrieval, geospatial processing, simulation, and artifact-backed computation.

In practice

Use TerraBench to evaluate LLM agents on Earth-science tasks.
Focus agent development on argument grounding and numeric precision.
Implement artifact-centered design for auditable scientific workflows.

Topics

Earth-System Data
LLM Agents
Scientific Benchmarking
Geospatial Reasoning
Climate Modeling
Tool-Augmented LLMs

Code references

Takerdat23/TerraBench

Best for: AI Scientist, Research Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.