Mechanism-Guided Selective Unlearning for RLVR-Induced Reasoning
Summary
MAST (Mechanism-Aligned Selective Targeting) is a novel mechanism-guided method designed for selectively unlearning RLVR-induced reasoning in language models, significantly reducing collateral damage compared to standard full-parameter updates. The method addresses the observation that the SFT-to-RLVR increment differs sharply from the SFT update in token-level delta-log-probability. Traditional full-parameter gradient ascent for unlearning often degrades performance on retained tasks like MATH and GSM8K. MAST mitigates this by ranking attention-projection tensors based on off-principal energy, update magnitude, and forget-gradient coupling magnitude, then updating only the top-ranked subset. On the primary model, MAST achieved statistically significant target forgetting, improving MATH forget from 45/150 to 37/150 (McNemar p=0.0078), while preserving GSM8K performance (+0.8 pp) and MATH retain (-0.5 pp). This advantage was consistent across different seeds, NPO/SimNPO objectives, and Qwen3 models, where full-parameter unlearning caused GSM8K collapse.
Key takeaway
For AI Scientists developing or deploying large language models, especially those fine-tuned with RLVR, you should consider implementing mechanism-guided selective unlearning methods like MAST. This approach allows you to remove specific undesirable reasoning patterns without significantly degrading performance on other critical tasks. By targeting only relevant attention-projection tensors, you can achieve precise forgetting, preserving model utility where full-parameter unlearning would cause collapse. Evaluate its effectiveness using statistical tests like McNemar's.
Key insights
Selective unlearning via mechanism-aligned targeting reduces collateral damage in RLVR-induced reasoning models.
Principles
- RLVR increments differ from SFT updates.
- Targeted unlearning preserves general capabilities.
- Ranking tensors guides selective updates.
Method
MAST ranks attention-projection tensors by off-principal energy, update magnitude, and forget-gradient coupling magnitude, then updates only the top-ranked subset for selective unlearning.
In practice
- Apply MAST to mitigate unlearning side effects.
- Evaluate unlearning with McNemar test.
- Consider NPO/SimNPO objectives.
Topics
- Mechanism-Aligned Selective Targeting
- RLVR Unlearning
- Language Model Fine-tuning
- Attention-Projection Tensors
- Qwen2.5-Math-1.5B
- Qwen3-1.7B-Base
Best for: Research Scientist, AI Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.