Retrospective Harness Optimization: Improving LLM Agents via Self-Preference over Trajectory Rollouts
Summary
Retrospective Harness Optimization (RHO) is a self-supervised method designed to improve AI agent performance by optimizing its "harness" (skills, tools, workflows) using only past trajectories, eliminating the need for ground-truth validation sets. RHO selects a diverse coreset of challenging tasks from past experiences, re-solves them in parallel, and diagnoses failures using self-validation and self-consistency signals. It then generates candidate harness updates and selects the most effective one through pairwise self-preference. Evaluated across software engineering, technical work, and knowledge work domains, RHO notably improved the pass rate on SWE-Bench Pro from 59% to 78% in a single optimization round without external grading. The optimized harness alters agent behavior, targeting prior failure modes and sustaining higher accuracy in long-horizon sessions.
Key takeaway
For AI engineers deploying LLM agents in dynamic environments, you should consider implementing self-supervised harness optimization to continuously improve agent performance without relying on costly labeled validation data. By retrospectively analyzing past agent trajectories, you can identify and address failure modes, leading to more robust and accurate long-horizon task execution. Ensure audit logs are maintained and human approval is required for sensitive harness edits to mitigate risks of amplifying mistaken preferences or unsafe procedures.
Key insights
AI agents can self-improve their operational harness by retrospectively analyzing past unlabeled trajectories.
Principles
- Harness optimization benefits from balancing task difficulty and diversity.
- Self-validation and self-consistency signals are crucial for effective diagnosis.
- Pairwise self-preference can reliably select effective harness updates.
Method
RHO selects a diverse, challenging coreset of past tasks, generates parallel rollouts, extracts self-validation and self-consistency signals, then proposes and selects the best harness update via self-preference.
In practice
- Implement a Determinantal Point Process (DPP) for coreset selection.
- Use parallel rollouts to generate diagnostic signals.
- Employ agent self-preference for selecting harness updates.
Topics
- LLM Agents
- Harness Optimization
- Self-Supervised Learning
- Trajectory Analysis
- SWE-Bench Pro
- Determinantal Point Process
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.