Learning to Hand Off: Provably Convergent Workflow Learning under Interface Constraints
Summary
Researchers introduce the Interface-Constrained Semi-Markov Decision Process (IC-SMDP) and IC-Q, an asynchronous decentralized Q-learning algorithm, to address workflow learning in multi-agent LLM pipelines. These pipelines operate across organizational boundaries, where agents observe only local functions of a shared artifact and their private state, without centralized access to joint trajectories. IC-Q facilitates cross-agent coordination with a single scalar at each handoff. The study provides the first finite-sample bound for neural IC-Q under decentralized partial observability, decomposing error into neural function-approximation, interface representation gap, and mixing-time residual. Four experiments, including a synthetic IC-SMDP with N=10 agents, multi-LLM mathematical reasoning using GPT-4o-mini, multi-agent routing, and multi-agent CPU programming, empirically validate the bound's terms and demonstrate IC-Q's ability to match a centralized oracle, achieving up to 100% routing accuracy and ~80% CPU programming accuracy.
Key takeaway
For AI Engineers designing multi-agent LLM systems that span organizational or trust boundaries, you no longer need joint trajectory access or centralized training. This research demonstrates that IC-Q allows your agents to learn optimal workflows and routing policies with only scalar coordination at handoffs. You can achieve provable convergence and match centralized oracle performance, even with interface-limited observations, by carefully designing agent interfaces to minimize the representation gap.
Key insights
The IC-SMDP framework and IC-Q algorithm enable provably convergent decentralized workflow learning for LLM agents with limited, scalar-based coordination.
Principles
- Decentralized agents can coordinate via single scalar values.
- Interface constraints introduce a quantifiable representation gap.
- Finite-sample bounds decompose into distinct error sources.
Method
IC-Q is an asynchronous Q-learning algorithm where each agent updates its local Q-network using local experience and a scalar maximum target value from other agents' current Q-networks.
In practice
- Design agent interfaces to minimize the representation gap.
- Use IC-Q for multi-LLM routing without joint trajectories.
- Apply IC-Q to learn optimal workflows for specialized agents.
Topics
- Multi-agent LLM Systems
- Decentralized Reinforcement Learning
- Q-learning Algorithms
- Semi-Markov Decision Processes
- Finite-sample Convergence
- Interface Constraints
Best for: Research Scientist, AI Scientist, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.