CodeTracer: Towards Traceable Agent States

2026-03-17 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

CodeTracer is a novel tracing architecture designed to debug complex code agents by parsing heterogeneous run artifacts and reconstructing full state transition histories as hierarchical trace trees. It performs failure onset localization to pinpoint the origin of errors and their downstream chains, addressing the difficulty of observing state transitions and error propagation in multi-stage agent workflows. To enable systematic evaluation, the researchers developed CodeTraceBench, a benchmark of 3,326 executed trajectories from four widely used code agent frameworks (SWE-Agent, MiniSWE-Agent, OpenHands, Terminus 2) and five model backbones (Claude-sonnet-4, GPT-5, DeepSeek-V3.2, Qwen3-Coder-480B, Kimi-K2-Instruct) across diverse coding tasks like bug fixing and refactoring. Experiments demonstrate that CodeTracer significantly outperforms direct prompting and lightweight baselines in failure localization, achieving up to 48% macro F1 with lower token costs, and its diagnostic signals consistently recover originally failed runs under matched budgets through reflective replay.

Key takeaway

For Machine Learning Engineers and Research Scientists developing or deploying code agents, understanding failure origins is critical. CodeTracer offers a robust framework to pinpoint the earliest failure-critical stage and error-relevant steps in complex agent trajectories. You should consider integrating hierarchical tracing and diagnostic feedback mechanisms into your agent development lifecycle to improve debugging efficiency and enable targeted recovery from early missteps, rather than relying solely on end-to-end metrics.

Key insights

CodeTracer provides structured tracing and failure localization for complex code agent trajectories, improving debugging and recovery.

Principles

Error types shift predictably across workflow stages.
Agent success saturates quickly with iteration budget.
Architectural complexity does not guarantee proportional success gains.

Method

CodeTracer uses evolving extraction for artifact parsing, tree indexing for hierarchical trace reconstruction, and diagnosis for failure onset localization, with optional reflective replay for recovery.

In practice

Use stage-aware guardrails to preempt failures.
Prioritize backbone capability over orchestration complexity.
Inject diagnostic signals for reflective replay to recover failed runs.

Topics

CodeTracer
CodeTraceBench
Code Agents
Failure Onset Localization
Hierarchical Trace Tree

Code references

Best for: Machine Learning Engineer, Research Scientist, AI Scientist, AI Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.