SAFARI: Scaling Long Horizon Agentic Fault Attribution via Active Investigation
Summary
SAFARI (Scaling long-horizon Agentic Fault AttRibution via active Investigation) is a new framework designed to diagnose failures in autonomous agents performing complex multi-step, multi-agent tasks. It addresses the limitation of current methods that load entire execution trajectories into Large Language Model (LLM) context windows, which suffer from attention dilution and fail when traces exceed context limits. SAFARI replaces this linear context loading with a tool-augmented diagnostic loop, equipping LLMs with a specialized toolbox to read and search trajectory segments, alongside a persistent Short-Term Memory (STM) for cross-turn reasoning. This approach effectively decouples diagnostic accuracy from architectural context constraints. Experiments show SAFARI outperforms state-of-the-art results by 20% on the Who&When dataset within a 1M token budget and by 19% on the TRAIL GAIA subset with a 25K token budget. Notably, it maintains 0.58 precision even when the target fault is 5x beyond the model's native context window, a scenario where traditional evaluators completely fail.
Key takeaway
For AI Engineers developing or deploying autonomous agents, especially those tackling long-horizon, multi-step tasks, you should recognize that traditional fault diagnosis methods are insufficient due to context window limitations. SAFARI demonstrates a viable path to overcome this by integrating tool-augmented LLMs and persistent Short-Term Memory. Consider exploring active investigation frameworks to maintain diagnostic precision and attribute faults effectively in systems with extensive execution traces, even when they exceed native model context.
Key insights
SAFARI uses tool-augmented LLMs and STM to diagnose agent failures beyond context window limits.
Principles
- Decouple diagnostic accuracy from context limits.
- Use specialized tools for trajectory segment access.
- Employ persistent memory for cross-turn reasoning.
Method
SAFARI replaces linear context loading with a tool-augmented diagnostic loop. It equips LLMs with a specialized toolbox to read/search trajectory segments and uses persistent Short-Term Memory for cross-turn reasoning.
In practice
- Diagnose agent failures in long-horizon tasks.
- Attribute faults in multi-step agentic systems.
- Improve diagnostic precision beyond context limits.
Topics
- Autonomous Agents
- Fault Attribution
- Large Language Models
- Context Window Management
- Tool-Augmented LLMs
- Short-Term Memory
Best for: Research Scientist, AI Architect, AI Scientist, AI Engineer, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.