STT-Arena: A More Realistic Environment for Tool-Using with Spatio-Temporal Dynamics
Summary
STT-Arena, a new benchmark, evaluates large language models' (LLMs) ability to replan and adapt to mid-task disruptions in real-world agentic applications. It features 227 interactive tasks across nine spatio-temporal conflict types and four solvability levels, set in an executable environment with injected spatio-temporal triggers that invalidate ongoing plans. Evaluations show that even state-of-the-art proprietary models, such as Claude-4.6-Opus, achieve less than 40% overall accuracy, indicating significant challenges in spatio-temporal dynamic reasoning. Analysis of failure trajectories identified three common error modes: Stale-State Execution, Misdiagnosis of Dynamic Triggers, and Missing Post-Adaptation Verification. Researchers developed STT-Agent-4B, which uses an iterative trajectory refinement technique combined with online reinforcement learning, to surpass frontier LLMs on the STT-Arena benchmark.
Key takeaway
For research scientists developing agentic LLMs, the STT-Arena benchmark highlights critical gaps in dynamic reasoning and replanning. You should prioritize developing models that can detect state shifts, construct revised execution strategies, and perform post-adaptation verification to overcome common failure modes like stale-state execution and misdiagnosis of dynamic triggers.
Key insights
LLMs struggle with adaptive replanning under spatio-temporal dynamics, even advanced models.
Principles
- Real-world agents need dynamic replanning.
- Spatio-temporal conflicts challenge LLMs.
- Iterative refinement improves agent adaptation.
Method
The STT-Agent-4B model uses iterative trajectory refinement to eliminate failure patterns from training data, combined with online reinforcement learning, to improve performance on dynamic tasks.
In practice
- Focus LLM training on dynamic replanning.
- Implement post-adaptation verification steps.
- Address stale-state execution in agents.
Topics
- STT-Arena
- Spatio-Temporal Dynamics
- Tool-Using LLMs
- Agentic Applications
- Dynamic Reasoning
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.