HARBOR: Automated Harness Optimization
Summary
Harbor introduces an automated approach to optimizing the "harness" that wraps large language models (LLMs) in long-horizon agent systems. This harness, comprising elements like context compaction, tool caching, and semantic memory, constitutes the majority of an agent's codebase (e.g., ~98.4% for Claude Code). The paper formalizes automated harness optimization (AHO) as a constrained noisy Bayesian optimization problem over a mixed-variable, cost-heterogeneous configuration space. It proposes Harbor (Harness Axis-aligned Regularized Bayesian Optimization Routine) as a reference solver, utilizing a block-additive SAAS surrogate, multi-fidelity cost-aware acquisition, and TuRBO trust regions. A case study on a production coding agent, codex-py, demonstrated that manual tuning over four rounds yielded only one statistically credible net win (17/89 vs. 15/89 baseline), while an Oracle (best-of-all-configurations union) achieved 81/89, highlighting the limitations of manual approaches.
Key takeaway
For NLP Engineers and Research Scientists developing long-horizon LLM agents, relying solely on manual harness tuning is inefficient and suboptimal. You should explore automated harness optimization (AHO) frameworks like Harbor to systematically discover optimal configurations, especially when dealing with complex, flag-gated feature spaces. This shift can significantly improve agent performance and reduce the time spent on iterative, error-prone manual adjustments.
Key insights
Automated optimization of LLM agent harnesses is crucial for performance, outperforming manual tuning significantly.
Principles
- Harness design is a first-class ML problem.
- Automated configuration search dominates manual tuning.
- Net-positive harness features are class-specific subsets.
Method
Harbor formalizes AHO as constrained noisy Bayesian optimization, using a block-additive SAAS surrogate, multi-fidelity cost-aware acquisition, and TuRBO trust regions for efficient search.
In practice
- Implement flag-gated features for systematic evaluation.
- Use telemetry counters for warm-start-aware evaluation.
- Prioritize component-internal tuning for improvements.
Topics
- Automated Harness Optimization
- Language Model Agents
- Bayesian Optimization
- Harbor Algorithm
- Harness Design
Best for: NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.LG updates on arXiv.org.