CodeTracer: Towards Traceable Agent States
Summary
CodeTracer is a novel tracing architecture designed to debug complex code agents by parsing heterogeneous run artifacts and reconstructing full state transition histories as hierarchical trace trees. It performs failure onset localization to pinpoint the origin of errors and their downstream chains, addressing the difficulty of observing state transitions and error propagation in multi-stage agent workflows. To enable systematic evaluation, the researchers developed CodeTraceBench, a benchmark of 3,326 executed trajectories from four widely used code agent frameworks (SWE-Agent, MiniSWE-Agent, OpenHands, Terminus 2) and five model backbones (Claude-sonnet-4, GPT-5, DeepSeek-V3.2, Qwen3-Coder-480B, Kimi-K2-Instruct) across diverse coding tasks like bug fixing and refactoring. Experiments demonstrate that CodeTracer significantly outperforms direct prompting and lightweight baselines in failure localization, achieving up to 48% macro F1 with lower token costs, and its diagnostic signals consistently recover originally failed runs under matched budgets through reflective replay.
Key takeaway
For Machine Learning Engineers and Research Scientists developing or deploying code agents, understanding failure origins is critical. CodeTracer offers a robust framework to pinpoint the earliest failure-critical stage and error-relevant steps in complex agent trajectories. You should consider integrating hierarchical tracing and diagnostic feedback mechanisms into your agent development lifecycle to improve debugging efficiency and enable targeted recovery from early missteps, rather than relying solely on end-to-end metrics.
Key insights
CodeTracer provides structured tracing and failure localization for complex code agent trajectories, improving debugging and recovery.
Principles
- Error types shift predictably across workflow stages.
- Agent success saturates quickly with iteration budget.
- Architectural complexity does not guarantee proportional success gains.
Method
CodeTracer uses evolving extraction for artifact parsing, tree indexing for hierarchical trace reconstruction, and diagnosis for failure onset localization, with optional reflective replay for recovery.
In practice
- Use stage-aware guardrails to preempt failures.
- Prioritize backbone capability over orchestration complexity.
- Inject diagnostic signals for reflective replay to recover failed runs.
Topics
- CodeTracer
- CodeTraceBench
- Code Agents
- Failure Onset Localization
- Hierarchical Trace Tree
Code references
Best for: Machine Learning Engineer, Research Scientist, AI Scientist, AI Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.