MARS$^2$: Scaling Multi-Agent Tree Search via Reinforcement Learning for Code Generation
Summary
MARS² (Multi-Agent Reinforced Tree-Search Scaling) is a new unified reinforcement learning (RL) framework designed to enhance code generation by integrating multiple independently-optimized agents within a shared tree-structured search environment. Existing RL methods for reasoning tasks like code generation often suffer from limited trajectory diversity, while search-enhanced RL is constrained by single-agent policy priors. MARS² addresses these limitations by modeling the search tree as a learnable multi-agent interaction environment, allowing heterogeneous agents to collaboratively generate and refine solutions. The framework introduces a path-level group advantage formulation with tree-consistent reward shaping to facilitate effective credit assignment across complex search trajectories. Experimental results on code generation benchmarks demonstrate that MARS² consistently improves performance across various model combinations and training settings, confirming the benefits of combining multi-agent collaboration with tree search in RL.
Key takeaway
For AI Engineers and Research Scientists developing advanced code generation systems, MARS² offers a promising approach to overcome limitations of single-agent RL by integrating multi-agent collaboration with tree search. You should consider exploring this framework to improve trajectory diversity and overall performance in reasoning-intensive tasks. The publicly available code at [https://github.com/TsinghuaC3I/MARTI] provides a direct path for implementation and experimentation.
Key insights
Multi-agent collaboration within a shared tree search environment enhances RL performance for code generation.
Principles
- Diverse exploration improves RL performance.
- Multi-agent interaction can provide diverse signals.
- Reward shaping aids credit assignment in complex trajectories.
Method
MARS² models a search tree as a multi-agent environment, where agents collaboratively refine solutions. It uses a path-level group advantage with tree-consistent reward shaping for credit assignment.
In practice
- Apply MARS² for complex code generation tasks.
- Explore multi-agent RL for diverse solution exploration.
- Utilize tree-consistent reward shaping for credit assignment.
Topics
- MARS$^2$
- Multi-Agent Tree Search
- Reinforcement Learning
- Code Generation
- Reward Shaping
Code references
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.