HARBOR: Automated Harness Optimization

2026-04-24 · Source: cs.LG updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, medium

Summary

Harbor introduces an automated approach to optimizing the "harness" that wraps large language models (LLMs) in long-horizon agent systems. This harness, comprising elements like context compaction, tool caching, and semantic memory, constitutes the majority of an agent's codebase (e.g., ~98.4% for Claude Code). The paper formalizes automated harness optimization (AHO) as a constrained noisy Bayesian optimization problem over a mixed-variable, cost-heterogeneous configuration space. It proposes Harbor (Harness Axis-aligned Regularized Bayesian Optimization Routine) as a reference solver, utilizing a block-additive SAAS surrogate, multi-fidelity cost-aware acquisition, and TuRBO trust regions. A case study on a production coding agent, codex-py, demonstrated that manual tuning over four rounds yielded only one statistically credible net win (17/89 vs. 15/89 baseline), while an Oracle (best-of-all-configurations union) achieved 81/89, highlighting the limitations of manual approaches.

Key takeaway

For NLP Engineers and Research Scientists developing long-horizon LLM agents, relying solely on manual harness tuning is inefficient and suboptimal. You should explore automated harness optimization (AHO) frameworks like Harbor to systematically discover optimal configurations, especially when dealing with complex, flag-gated feature spaces. This shift can significantly improve agent performance and reduce the time spent on iterative, error-prone manual adjustments.

Key insights

Automated optimization of LLM agent harnesses is crucial for performance, outperforming manual tuning significantly.

Principles

Harness design is a first-class ML problem.
Automated configuration search dominates manual tuning.
Net-positive harness features are class-specific subsets.

Method

Harbor formalizes AHO as constrained noisy Bayesian optimization, using a block-additive SAAS surrogate, multi-fidelity cost-aware acquisition, and TuRBO trust regions for efficient search.

In practice

Implement flag-gated features for systematic evaluation.
Use telemetry counters for warm-start-aware evaluation.
Prioritize component-internal tuning for improvements.

Topics

Automated Harness Optimization
Language Model Agents
Bayesian Optimization
Harbor Algorithm
Harness Design

Best for: NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.LG updates on arXiv.org.