Diverse reasoning traces teach LLMs to make better decisions
Summary
Amazon researchers introduced set-supervised fine-tuning (SSFT) and global forking policy optimization (GFPO) to train large language models (LLMs) that generate diverse reasoning paths. This method, presented at ICLR 2026, addresses mode collapse in traditional supervised fine-tuning (SFT) by modeling reasoning as a set of complete solution paths and using global forking tokens to elicit distinct reasoning modes. SSFT pairs multiple reasoning traces with dedicated forking tokens, employing a bipartite matching step to assign traces and encourage specialized behaviors. GFPO, a reinforcement learning approach, then learns to select the most effective reasoning mode for a given input by optimizing the forking-token distribution based on reward signals. The combined SSFT+GFPO approach achieved 5% to 7% gains in single-shot accuracy on benchmarks like AIME 2025 and LiveCodeBench-v5, outperforming SFT+GRPO, and improved pass@k without compromising pass@1 accuracy.
Key takeaway
For Machine Learning Engineers developing advanced LLM reasoning capabilities, consider integrating SSFT and GFPO. This approach enables your models to learn and select diverse reasoning strategies, improving single-shot accuracy by 5-7% on benchmarks like AIME 2025 and LiveCodeBench-v5. You can prevent mode collapse and enhance pass@k performance without sacrificing pass@1 by utilizing the open-sourced training pipeline and model weights.
Key insights
Training LLMs with diverse reasoning traces and explicit mode selection tokens improves accuracy and prevents mode collapse.
Principles
- Reasoning can be modeled as a set of complete solution paths.
- Global forking tokens elicit distinct, specialized reasoning modes.
- Diverse reasoning improves pass@k without sacrificing pass@1.
Method
Set-supervised fine-tuning (SSFT) uses bipartite matching to assign diverse traces to global forking tokens. Global forking policy optimization (GFPO) then uses reinforcement learning to select the optimal token for inference.
In practice
- Use global forking tokens to elicit distinct reasoning strategies.
- Apply SSFT to learn multiple, diverse solution strategies.
Topics
- Large Language Models
- Reasoning Traces
- Set-supervised Fine-tuning
- Global Forking Policy Optimization
- Reinforcement Learning
- Mode Collapse Prevention
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Amazon Science homepage.