From Failed Trajectories to Reliable LLM Agents: Diagnosing and Repairing Harness Flaws
Summary
HarnessFix is a novel trace-guided framework designed to diagnose failures in LLM-based agents and repair their underlying harnesses. It addresses the challenge of unreliable agent behavior by compiling raw execution traces and harness code into a Harness-aware Trace Intermediate Representation (HTIR), which normalizes fragmented trajectory evidence and captures step-level provenance. The framework then attributes failures to specific trajectory steps and harness layers, consolidating recurring diagnoses into actionable flaw records. HarnessFix maps these flaws to scoped repair operators, generating and validating harness patches under flaw-specific repair specifications to reduce target flaws without introducing regressions. Evaluated on SWE-Bench Verified, Terminal-Bench 2.0 Verified, GAIA, and AppWorld, HarnessFix improves held-out test performance over initial harnesses by 15.2%–50.0%, consistently outperforming human-designed and self-evolution baselines. It also reveals recurring harness-flaw patterns across the seven ETCLOVG layers (Execution, Tooling, Context, Lifecycle, Observability, Verification, Governance).
Key takeaway
For MLOps Engineers deploying LLM agents, you should prioritize trace-guided harness repair over prompt-only optimization. Your agent's reliability depends on diagnosing specific runtime failures and applying scoped, validated changes to the harness infrastructure. This approach, exemplified by HarnessFix, significantly improves performance by addressing root causes in layers like Tooling, Lifecycle, and Verification, rather than merely nudging model behavior. Implement a structured validation process to prevent regressions from broad, untargeted modifications.
Key insights
Agent reliability hinges on diagnosing and repairing harness flaws using trace-guided attribution, not just model improvements.
Principles
- Failures stem from harness, not just LLM.
- Trace-grounded diagnosis prevents broad changes.
- Scoped repairs reduce regression risk.
Method
HarnessFix compiles traces into HTIR, attributes failures to steps/layers, consolidates flaws, maps to scoped repair operators, generates patches, and validates against regressions.
In practice
- Implement HTIR for unified trace analysis.
- Use ETCLOVG taxonomy for layer-aware diagnosis.
- Apply scoped repair operators for targeted fixes.
Topics
- LLM Agents
- Agent Harnesses
- Failure Diagnosis
- Trace Analysis
- Automated Repair
- ETCLOVG Taxonomy
Code references
- cuga-project/cuga-agent
- harbor-framework/harbor
- MiroMindAI/MiroFlow
- anomalyco/opencode
- SkyworkAI/DeepResearchAgent
Best for: Research Scientist, AI Architect, Machine Learning Engineer, AI Scientist, AI Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.