Mechanism-Guided Selective Unlearning for RLVR-Induced Reasoning

2026-06-17 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

MAST (Mechanism-Aligned Selective Targeting) is a novel mechanism-guided method designed for selectively unlearning RLVR-induced reasoning in language models, significantly reducing collateral damage compared to standard full-parameter updates. The method addresses the observation that the SFT-to-RLVR increment differs sharply from the SFT update in token-level delta-log-probability. Traditional full-parameter gradient ascent for unlearning often degrades performance on retained tasks like MATH and GSM8K. MAST mitigates this by ranking attention-projection tensors based on off-principal energy, update magnitude, and forget-gradient coupling magnitude, then updating only the top-ranked subset. On the primary model, MAST achieved statistically significant target forgetting, improving MATH forget from 45/150 to 37/150 (McNemar p=0.0078), while preserving GSM8K performance (+0.8 pp) and MATH retain (-0.5 pp). This advantage was consistent across different seeds, NPO/SimNPO objectives, and Qwen3 models, where full-parameter unlearning caused GSM8K collapse.

Key takeaway

For AI Scientists developing or deploying large language models, especially those fine-tuned with RLVR, you should consider implementing mechanism-guided selective unlearning methods like MAST. This approach allows you to remove specific undesirable reasoning patterns without significantly degrading performance on other critical tasks. By targeting only relevant attention-projection tensors, you can achieve precise forgetting, preserving model utility where full-parameter unlearning would cause collapse. Evaluate its effectiveness using statistical tests like McNemar's.

Key insights

Selective unlearning via mechanism-aligned targeting reduces collateral damage in RLVR-induced reasoning models.

Principles

RLVR increments differ from SFT updates.
Targeted unlearning preserves general capabilities.
Ranking tensors guides selective updates.

Method

MAST ranks attention-projection tensors by off-principal energy, update magnitude, and forget-gradient coupling magnitude, then updates only the top-ranked subset for selective unlearning.

In practice

Apply MAST to mitigate unlearning side effects.
Evaluate unlearning with McNemar test.
Consider NPO/SimNPO objectives.

Topics

Mechanism-Aligned Selective Targeting
RLVR Unlearning
Language Model Fine-tuning
Attention-Projection Tensors
Qwen2.5-Math-1.5B
Qwen3-1.7B-Base

Best for: Research Scientist, AI Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.