Diverse reasoning traces teach LLMs to make better decisions

2026-05-26 · Source: Amazon Science homepage · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Natural Language Processing · Depth: Expert, medium

Summary

Amazon researchers introduced set-supervised fine-tuning (SSFT) and global forking policy optimization (GFPO) to train large language models (LLMs) that generate diverse reasoning paths. This method, presented at ICLR 2026, addresses mode collapse in traditional supervised fine-tuning (SFT) by modeling reasoning as a set of complete solution paths and using global forking tokens to elicit distinct reasoning modes. SSFT pairs multiple reasoning traces with dedicated forking tokens, employing a bipartite matching step to assign traces and encourage specialized behaviors. GFPO, a reinforcement learning approach, then learns to select the most effective reasoning mode for a given input by optimizing the forking-token distribution based on reward signals. The combined SSFT+GFPO approach achieved 5% to 7% gains in single-shot accuracy on benchmarks like AIME 2025 and LiveCodeBench-v5, outperforming SFT+GRPO, and improved pass@k without compromising pass@1 accuracy.

Key takeaway

For Machine Learning Engineers developing advanced LLM reasoning capabilities, consider integrating SSFT and GFPO. This approach enables your models to learn and select diverse reasoning strategies, improving single-shot accuracy by 5-7% on benchmarks like AIME 2025 and LiveCodeBench-v5. You can prevent mode collapse and enhance pass@k performance without sacrificing pass@1 by utilizing the open-sourced training pipeline and model weights.

Key insights

Training LLMs with diverse reasoning traces and explicit mode selection tokens improves accuracy and prevents mode collapse.

Principles

Reasoning can be modeled as a set of complete solution paths.
Global forking tokens elicit distinct, specialized reasoning modes.
Diverse reasoning improves pass@k without sacrificing pass@1.

Method

Set-supervised fine-tuning (SSFT) uses bipartite matching to assign diverse traces to global forking tokens. Global forking policy optimization (GFPO) then uses reinforcement learning to select the optimal token for inference.

In practice

Use global forking tokens to elicit distinct reasoning strategies.
Apply SSFT to learn multiple, diverse solution strategies.

Topics

Large Language Models
Reasoning Traces
Set-supervised Fine-tuning
Global Forking Policy Optimization
Reinforcement Learning
Mode Collapse Prevention

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Amazon Science homepage.