INFUSER: Influence-Guided Self-Evolution Improves Reasoning

2026-06-08 · Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, quick

Summary

INFUSER is an iterative co-training framework that significantly improves the reasoning capabilities of pretrained language models through self-evolution, requiring minimal external supervision. It comprises a Generator, which drafts questions and golden answers from unstructured documents, and a Solver, which trains on this data. The Solver is rewarded for correctness, while the Generator is rewarded by an optimizer-aware influence score, measuring a question's utility in improving the Solver on the target distribution. INFUSER introduces DuGRPO, a dual-normalized GRPO variant, for Generator training to manage noisy influence scores, creating an adaptive curriculum. On Qwen3-8B-Base, INFUSER achieves over 20% relative improvement against strong self-evolution baselines on Olympiad and SuperGPQA benchmarks. An 8B INFUSER co-evolving generator also outperforms a frozen 32B thinking generator on math and coding tasks, demonstrating the framework's effectiveness and generalizability.

Key takeaway

For Machine Learning Engineers developing self-evolving language models, INFUSER provides a powerful framework to significantly boost reasoning capabilities. You should consider adopting its influence-guided co-training approach, particularly the DuGRPO variant, to create adaptive curricula that directly improve your solver models. This method allows you to build more robust reasoning systems, like those outperforming 32B models with an 8B generator, by focusing on questions most beneficial for current model improvement.

Key insights

Influence-guided self-evolution with co-training and adaptive curricula significantly improves language model reasoning.

Principles

Reward generators by solver improvement.
Co-training roles enhance self-evolution.
Adaptive curricula optimize learning paths.

Method

INFUSER co-trains a Generator (drafts questions/answers) and a Solver (trains on them). Generator reward uses optimizer-aware influence scores, processed by DuGRPO, to form an adaptive curriculum.

In practice

Implement DuGRPO for noisy reward signals.
Employ co-evolving generators for math/coding.
Extend framework to finetuned LLMs.

Topics

Self-evolution
Language Model Reasoning
Co-training Frameworks
DuGRPO
Influence Scores
Adaptive Curriculum

Code references

FFishy-git/INFUSER

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.