Learning to Hand Off: Provably Convergent Workflow Learning under Interface Constraints

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

Researchers introduce the Interface-Constrained Semi-Markov Decision Process (IC-SMDP) and IC-Q, an asynchronous decentralized Q-learning algorithm, to address workflow learning in multi-agent LLM pipelines. These pipelines operate across organizational boundaries, where agents observe only local functions of a shared artifact and their private state, without centralized access to joint trajectories. IC-Q facilitates cross-agent coordination with a single scalar at each handoff. The study provides the first finite-sample bound for neural IC-Q under decentralized partial observability, decomposing error into neural function-approximation, interface representation gap, and mixing-time residual. Four experiments, including a synthetic IC-SMDP with N=10 agents, multi-LLM mathematical reasoning using GPT-4o-mini, multi-agent routing, and multi-agent CPU programming, empirically validate the bound's terms and demonstrate IC-Q's ability to match a centralized oracle, achieving up to 100% routing accuracy and ~80% CPU programming accuracy.

Key takeaway

For AI Engineers designing multi-agent LLM systems that span organizational or trust boundaries, you no longer need joint trajectory access or centralized training. This research demonstrates that IC-Q allows your agents to learn optimal workflows and routing policies with only scalar coordination at handoffs. You can achieve provable convergence and match centralized oracle performance, even with interface-limited observations, by carefully designing agent interfaces to minimize the representation gap.

Key insights

The IC-SMDP framework and IC-Q algorithm enable provably convergent decentralized workflow learning for LLM agents with limited, scalar-based coordination.

Principles

Method

IC-Q is an asynchronous Q-learning algorithm where each agent updates its local Q-network using local experience and a scalar maximum target value from other agents' current Q-networks.

In practice

Topics

Best for: Research Scientist, AI Scientist, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.