AgentNoiseBench: Benchmarking Robustness of Tool-Using LLM Agents Under Noisy Condition

2024-05-08 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Robotics & Autonomous Systems · Depth: Advanced, extended

Summary

AgentNoiseBench introduces a framework for systematically evaluating the robustness of tool-using Large Language Model (LLM) agents in noisy real-world environments. The framework categorizes environmental noise into user-noise (e.g., ambiguous instructions, topic drift) and tool-noise (e.g., execution failures, incomplete responses). An automated pipeline injects controllable noise into existing agent-centric benchmarks like $\tau^{2}$-Bench and VitaBench, while preserving task solvability. Extensive evaluations across 24 diverse LLMs, including OpenAI's o-series and GPT-5.2, DeepSeek, Anthropic's Claude, and Google's Gemini-2.5, reveal consistent performance degradation, with an average accuracy drop of 20.8%. The study found that reasoning ability does not strongly correlate with noise robustness, and agents are more sensitive to tool-side noise than user-side noise, particularly when noise is injected during the middle stage of a task.

Key takeaway

For AI Scientists and NLP Engineers developing LLM agents, recognize that current models exhibit significant vulnerability to real-world noise, particularly from tool-side issues like execution failures. Your evaluation and training paradigms must move beyond idealized assumptions to incorporate diverse noise types, focusing on robustness-oriented training and trajectory-aware evaluation to build agents that perform reliably in complex, imperfect environments. Prioritize mitigating tool-noise and mid-trajectory disruptions.

Key insights

LLM agents universally suffer significant performance degradation under realistic noise, especially from tool failures.

Principles

Real-world noise is critical for agent evaluation.
Injected noise must preserve task solvability.
Evaluate procedural integrity, not just outcomes.

Method

AgentNoiseBench uses an automated pipeline to inject user-noise and tool-noise into existing benchmarks, employing a trajectory-aware evaluation protocol to assess both final outcomes and intermediate reasoning steps.

In practice

Tool execution failures are most destructive.
Instruction conflicts are the most harmful user noise.
Agents are most sensitive to mid-stage noise injection.

Topics

LLM Agent Robustness
Environmental Noise Modeling
Tool-Using LLMs
Agent Evaluation Benchmarks
Trajectory-Aware Evaluation

Code references

keven-cyber/agentnoisebench

Best for: AI Scientist, Research Scientist, NLP Engineer, AI Researcher, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.