STT-Arena: A More Realistic Environment for Tool-Using with Spatio-Temporal Dynamics

· Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, quick

Summary

STT-Arena, a new benchmark, evaluates large language models' (LLMs) ability to replan and adapt to mid-task disruptions in real-world agentic applications. It features 227 interactive tasks across nine spatio-temporal conflict types and four solvability levels, set in an executable environment with injected spatio-temporal triggers that invalidate ongoing plans. Evaluations show that even state-of-the-art proprietary models, such as Claude-4.6-Opus, achieve less than 40% overall accuracy, indicating significant challenges in spatio-temporal dynamic reasoning. Analysis of failure trajectories identified three common error modes: Stale-State Execution, Misdiagnosis of Dynamic Triggers, and Missing Post-Adaptation Verification. Researchers developed STT-Agent-4B, which uses an iterative trajectory refinement technique combined with online reinforcement learning, to surpass frontier LLMs on the STT-Arena benchmark.

Key takeaway

For research scientists developing agentic LLMs, the STT-Arena benchmark highlights critical gaps in dynamic reasoning and replanning. You should prioritize developing models that can detect state shifts, construct revised execution strategies, and perform post-adaptation verification to overcome common failure modes like stale-state execution and misdiagnosis of dynamic triggers.

Key insights

LLMs struggle with adaptive replanning under spatio-temporal dynamics, even advanced models.

Principles

Method

The STT-Agent-4B model uses iterative trajectory refinement to eliminate failure patterns from training data, combined with online reinforcement learning, to improve performance on dynamic tasks.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.