Forgetting That Sticks: Quantization-Permanent Unlearning via Circuit Attribution
Summary
A new study reveals that standard machine unlearning methods, evaluated in full precision, systematically fail when language models undergo 4-bit post-training quantization (PTQ). This failure is attributed to per-parameter updates being 47-828x smaller than the NF4 quantization bin width, preventing updates from clearing quantization boundaries and leading to a sparsity-permanence tradeoff. The research introduces MANSU (Mechanistic-Aligned Null-Space Unlearning), a novel method that addresses this by combining causal circuit attribution to identify minimal forget-set subgraphs, circuit-restricted null-space projection with a diagonal-Fisher retain bound, and a per-parameter magnitude floor to ensure quantization survival. Additionally, Circuit Attribution Divergence (CAD) is presented as a new metric to verify structural erasure, distinguishing it from mere behavioral suppression. MANSU is the first method demonstrated to achieve meaningful forgetting, retain preservation, a non-positive PTQ gap, and structural erasure across various model families and hazard benchmarks, outperforming gradient-based baselines which can recover up to +0.05 accuracy post-compression.
Key takeaway
For research scientists developing machine unlearning techniques for deployed large language models, you must account for the impact of post-training quantization. Your current gradient-based unlearning methods are likely to be reversed by 4-bit compression. Prioritize methods like MANSU that guarantee quantization survival and structural erasure, using metrics like Circuit Attribution Divergence to validate true forgetting, ensuring your unlearning efforts are permanent and effective in real-world applications.
Key insights
Quantization can reverse machine unlearning, necessitating methods that ensure forgetting persists post-compression.
Principles
- Per-parameter updates must exceed quantization bin width.
- Structural erasure differs from behavioral suppression.
Method
MANSU uses causal circuit attribution, circuit-restricted null-space projection with a diagonal-Fisher retain bound, and a per-parameter magnitude floor to ensure quantization-permanent unlearning.
In practice
- Use MANSU for robust unlearning in quantized models.
- Employ CAD to verify structural erasure.
Topics
- Machine Unlearning
- Post-Training Quantization
- Quantization-Permanent Unlearning
- Causal Circuit Attribution
- Null-Space Projection
Best for: Research Scientist, CTO, VP of Engineering/Data, AI Scientist, Machine Learning Engineer, AI Security Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.