INFUSER: Influence-Guided Self-Evolution Improves Reasoning

· Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, quick

Summary

INFUSER is an iterative co-training framework that significantly improves the reasoning capabilities of pretrained language models through self-evolution, requiring minimal external supervision. It comprises a Generator, which drafts questions and golden answers from unstructured documents, and a Solver, which trains on this data. The Solver is rewarded for correctness, while the Generator is rewarded by an optimizer-aware influence score, measuring a question's utility in improving the Solver on the target distribution. INFUSER introduces DuGRPO, a dual-normalized GRPO variant, for Generator training to manage noisy influence scores, creating an adaptive curriculum. On Qwen3-8B-Base, INFUSER achieves over 20% relative improvement against strong self-evolution baselines on Olympiad and SuperGPQA benchmarks. An 8B INFUSER co-evolving generator also outperforms a frozen 32B thinking generator on math and coding tasks, demonstrating the framework's effectiveness and generalizability.

Key takeaway

For Machine Learning Engineers developing self-evolving language models, INFUSER provides a powerful framework to significantly boost reasoning capabilities. You should consider adopting its influence-guided co-training approach, particularly the DuGRPO variant, to create adaptive curricula that directly improve your solver models. This method allows you to build more robust reasoning systems, like those outperforming 32B models with an 8B generator, by focusing on questions most beneficial for current model improvement.

Key insights

Influence-guided self-evolution with co-training and adaptive curricula significantly improves language model reasoning.

Principles

Method

INFUSER co-trains a Generator (drafts questions/answers) and a Solver (trains on them). Generator reward uses optimizer-aware influence scores, processed by DuGRPO, to form an adaptive curriculum.

In practice

Topics

Code references

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.