MARS$^2$: Scaling Multi-Agent Tree Search via Reinforcement Learning for Code Generation

· Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, quick

Summary

MARS² (Multi-Agent Reinforced Tree-Search Scaling) is a new unified reinforcement learning (RL) framework designed to enhance code generation by integrating multiple independently-optimized agents within a shared tree-structured search environment. Existing RL methods for reasoning tasks like code generation often suffer from limited trajectory diversity, while search-enhanced RL is constrained by single-agent policy priors. MARS² addresses these limitations by modeling the search tree as a learnable multi-agent interaction environment, allowing heterogeneous agents to collaboratively generate and refine solutions. The framework introduces a path-level group advantage formulation with tree-consistent reward shaping to facilitate effective credit assignment across complex search trajectories. Experimental results on code generation benchmarks demonstrate that MARS² consistently improves performance across various model combinations and training settings, confirming the benefits of combining multi-agent collaboration with tree search in RL.

Key takeaway

For AI Engineers and Research Scientists developing advanced code generation systems, MARS² offers a promising approach to overcome limitations of single-agent RL by integrating multi-agent collaboration with tree search. You should consider exploring this framework to improve trajectory diversity and overall performance in reasoning-intensive tasks. The publicly available code at [https://github.com/TsinghuaC3I/MARTI] provides a direct path for implementation and experimentation.

Key insights

Multi-agent collaboration within a shared tree search environment enhances RL performance for code generation.

Principles

Method

MARS² models a search tree as a multi-agent environment, where agents collaboratively refine solutions. It uses a path-level group advantage with tree-consistent reward shaping for credit assignment.

In practice

Topics

Code references

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.