DICE: Entropy-Regularized Equilibrium Selection for Stable Multi-Agent LLM Coordination

2026-06-06 · Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

DICE, a novel framework for Entropy-Regularized Equilibrium Selection, addresses the instability in multi-agent large language model (LLM) systems that often underperform single strong models using best-of-N sampling. This instability stems from ill-posed equilibrium selection, leading to oscillations and drift, which cause unstable learning and linear Bayesian regret. DICE introduces the Heterogeneous Quantal Response Equilibrium (HQRE), an entropy-regularized concept featuring agent- and state-dependent temperatures. Under a monotonicity condition, HQRE is unique, supports linearly convergent mirror updates, and ensures bounded Bayesian regret. The framework is instantiated in two algorithms: DICE-PC, which coordinates frozen models through prompt-control actions, and DICE-FT, performing parameter-efficient mirror fine-tuning. Across eleven benchmarks in four domains, DICE significantly improves accuracy-cost trade-offs, with DICE-PC showing a 4.3 percentage point average improvement and DICE-FT an 8.5 point improvement on reasoning and planning tasks.

Key takeaway

For Machine Learning Engineers developing multi-agent LLM systems, if you are encountering instability or sub-optimal coordination, you should investigate DICE. This framework provides a principled method to achieve stable multi-agent performance by addressing ill-posed equilibrium selection. Implementing DICE-PC for prompt-control or DICE-FT for fine-tuning can significantly improve your accuracy-cost trade-offs, with reported gains of 4.3 to 8.5 percentage points on reasoning and planning tasks.

Key insights

Multi-agent LLM instability can be resolved by well-posed, entropy-regularized equilibrium selection using HQRE.

Principles

Ill-posed equilibrium selection causes multi-agent LLM instability.
HQRE ensures unique, stable coordination with bounded Bayesian regret.
Monotonicity enables linearly convergent updates and stability diagnostics.

Method

DICE-PC coordinates frozen models via prompt-control; DICE-FT uses parameter-efficient mirror fine-tuning to achieve HQRE.

In practice

DICE-PC improves reasoning/planning by 4.3 percentage points.
DICE-FT improves reasoning/planning by 8.5 percentage points.
Improves accuracy-cost trade-offs across diverse benchmarks.

Topics

Multi-Agent LLMs
Equilibrium Selection
HQRE
Prompt-Control
Parameter-Efficient Fine-Tuning
Markov Games

Best for: Research Scientist, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.