INFUSER: Influence-Guided Self-Evolution Improves Reasoning
Summary
INFUSER is an iterative co-training framework that significantly improves the reasoning capabilities of pretrained language models through self-evolution, requiring minimal external supervision. It comprises a Generator, which drafts questions and golden answers from unstructured documents, and a Solver, which trains on this data. The Solver is rewarded for correctness, while the Generator is rewarded by an optimizer-aware influence score, measuring a question's utility in improving the Solver on the target distribution. INFUSER introduces DuGRPO, a dual-normalized GRPO variant, for Generator training to manage noisy influence scores, creating an adaptive curriculum. On Qwen3-8B-Base, INFUSER achieves over 20% relative improvement against strong self-evolution baselines on Olympiad and SuperGPQA benchmarks. An 8B INFUSER co-evolving generator also outperforms a frozen 32B thinking generator on math and coding tasks, demonstrating the framework's effectiveness and generalizability.
Key takeaway
For Machine Learning Engineers developing self-evolving language models, INFUSER provides a powerful framework to significantly boost reasoning capabilities. You should consider adopting its influence-guided co-training approach, particularly the DuGRPO variant, to create adaptive curricula that directly improve your solver models. This method allows you to build more robust reasoning systems, like those outperforming 32B models with an 8B generator, by focusing on questions most beneficial for current model improvement.
Key insights
Influence-guided self-evolution with co-training and adaptive curricula significantly improves language model reasoning.
Principles
- Reward generators by solver improvement.
- Co-training roles enhance self-evolution.
- Adaptive curricula optimize learning paths.
Method
INFUSER co-trains a Generator (drafts questions/answers) and a Solver (trains on them). Generator reward uses optimizer-aware influence scores, processed by DuGRPO, to form an adaptive curriculum.
In practice
- Implement DuGRPO for noisy reward signals.
- Employ co-evolving generators for math/coding.
- Extend framework to finetuned LLMs.
Topics
- Self-evolution
- Language Model Reasoning
- Co-training Frameworks
- DuGRPO
- Influence Scores
- Adaptive Curriculum
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.