GeoNatureAgent Benchmark: Benchmarking LLM Agents for Environmental Geospatial Analysis Across Frontier and Open-Weight Foundation Models

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Geospatial AI · Depth: Expert, extended

Summary

The GeoNatureAgent Benchmark introduces the first benchmark for environmental analysis agents that operate via structured tool calls to a production-style geospatial API. It comprises 93 tasks across 18 categories, covering municipality-level analysis, multi-turn conversation, spatial reasoning, and error handling. The benchmark utilizes an open, self-hostable API serving three environmental indicators (CO2 absorption suitability, gully erosion probability, BigEarthNet V2 land cover) across Spain and Portugal via sixteen tools. Seven large language models (Claude Sonnet 4, DeepSeek V3.2, GLM-5, Gemini 2.5 Pro, Qwen3-235B, GPT-OSS-120B, and Llama 4 Scout) were evaluated under three temperature-1.0 seeds. Results indicate Claude Sonnet 4 achieved the highest capability at 60.8% ± 0.8%, closely followed by DeepSeek V3.2 at 56.3% ± 3.1%. DeepSeek V3.2 offers 93% of Claude's capability at 11x lower cost (\$0.011/case). Comparison tasks remain universally unsolved (0% on most models), and structured tool calling yields mean accuracies 25–35 percentage points below general-purpose GIS benchmarks. The benchmark, evaluation harness, and API are publicly available.

Key takeaway

For AI Scientists and Machine Learning Engineers developing geospatial agents, recognize that real-world environmental tasks are significantly harder than general GIS benchmarks. You should prioritize open-weight models like DeepSeek V3.2 for cost-sensitive deployments, as they offer competitive capability at substantially lower costs. Be prepared for systematic failures in complex reasoning tasks, such as close-value comparisons, and design your agents with robust error handling for these scenarios.

Key insights

Environmental geospatial agent performance is significantly lower on real-world APIs than general GIS benchmarks, with cost-effective open-weight models emerging.

Principles

Method

The GeoNatureAgent Benchmark evaluates LLM agents using a ReAct-style loop with structured tool calls to a production-style geospatial API, scoring 93 tasks via eight mechanistic checks.

In practice

Topics

Code references

Best for: AI Engineer, AI Architect, Entrepreneur, AI Scientist, Machine Learning Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.