GeoNatureAgent Benchmark: Benchmarking LLM Agents for Environmental Geospatial Analysis Across Frontier and Open-Weight Foundation Models

2026-06-12 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Geospatial AI · Depth: Expert, extended

Summary

The GeoNatureAgent Benchmark introduces the first benchmark for environmental analysis agents that operate via structured tool calls to a production-style geospatial API. It comprises 93 tasks across 18 categories, covering municipality-level analysis, multi-turn conversation, spatial reasoning, and error handling. The benchmark utilizes an open, self-hostable API serving three environmental indicators (CO2 absorption suitability, gully erosion probability, BigEarthNet V2 land cover) across Spain and Portugal via sixteen tools. Seven large language models (Claude Sonnet 4, DeepSeek V3.2, GLM-5, Gemini 2.5 Pro, Qwen3-235B, GPT-OSS-120B, and Llama 4 Scout) were evaluated under three temperature-1.0 seeds. Results indicate Claude Sonnet 4 achieved the highest capability at 60.8% ± 0.8%, closely followed by DeepSeek V3.2 at 56.3% ± 3.1%. DeepSeek V3.2 offers 93% of Claude's capability at 11x lower cost (\$0.011/case). Comparison tasks remain universally unsolved (0% on most models), and structured tool calling yields mean accuracies 25–35 percentage points below general-purpose GIS benchmarks. The benchmark, evaluation harness, and API are publicly available.

Key takeaway

For AI Scientists and Machine Learning Engineers developing geospatial agents, recognize that real-world environmental tasks are significantly harder than general GIS benchmarks. You should prioritize open-weight models like DeepSeek V3.2 for cost-sensitive deployments, as they offer competitive capability at substantially lower costs. Be prepared for systematic failures in complex reasoning tasks, such as close-value comparisons, and design your agents with robust error handling for these scenarios.

Key insights

Environmental geospatial agent performance is significantly lower on real-world APIs than general GIS benchmarks, with cost-effective open-weight models emerging.

Principles

Real-world API interaction is a stricter test for agents.
Cost-accuracy trade-offs favor open-weight models.
Domain-specific benchmarks are crucial for accurate evaluation.

Method

The GeoNatureAgent Benchmark evaluates LLM agents using a ReAct-style loop with structured tool calls to a production-style geospatial API, scoring 93 tasks via eight mechanistic checks.

In practice

Prioritize open-weight models for cost-sensitive deployments.
Expect lower accuracy on environmental geospatial tasks.
Implement robust error handling for comparison tasks.

Topics

LLM Agents
Geospatial Analysis
Environmental Monitoring
Tool Calling
Benchmark Evaluation
Cost-Accuracy Trade-off

Code references

gabrielireland/GeoNatureAgent_Benchmark

Best for: AI Engineer, AI Architect, Entrepreneur, AI Scientist, Machine Learning Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.