AgentNoiseBench: Benchmarking Robustness of Tool-Using LLM Agents Under Noisy Condition

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Robotics & Autonomous Systems · Depth: Advanced, extended

Summary

AgentNoiseBench introduces a framework for systematically evaluating the robustness of tool-using Large Language Model (LLM) agents in noisy real-world environments. The framework categorizes environmental noise into user-noise (e.g., ambiguous instructions, topic drift) and tool-noise (e.g., execution failures, incomplete responses). An automated pipeline injects controllable noise into existing agent-centric benchmarks like $\tau^{2}$-Bench and VitaBench, while preserving task solvability. Extensive evaluations across 24 diverse LLMs, including OpenAI's o-series and GPT-5.2, DeepSeek, Anthropic's Claude, and Google's Gemini-2.5, reveal consistent performance degradation, with an average accuracy drop of 20.8%. The study found that reasoning ability does not strongly correlate with noise robustness, and agents are more sensitive to tool-side noise than user-side noise, particularly when noise is injected during the middle stage of a task.

Key takeaway

For AI Scientists and NLP Engineers developing LLM agents, recognize that current models exhibit significant vulnerability to real-world noise, particularly from tool-side issues like execution failures. Your evaluation and training paradigms must move beyond idealized assumptions to incorporate diverse noise types, focusing on robustness-oriented training and trajectory-aware evaluation to build agents that perform reliably in complex, imperfect environments. Prioritize mitigating tool-noise and mid-trajectory disruptions.

Key insights

LLM agents universally suffer significant performance degradation under realistic noise, especially from tool failures.

Principles

Method

AgentNoiseBench uses an automated pipeline to inject user-noise and tool-noise into existing benchmarks, employing a trajectory-aware evaluation protocol to assess both final outcomes and intermediate reasoning steps.

In practice

Topics

Code references

Best for: AI Scientist, Research Scientist, NLP Engineer, AI Researcher, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.