An Illusion of Unlearning? Assessing Machine Unlearning Through Internal Representations
Summary
Many machine unlearning (MU) methods, despite promising results in erasing data influence, are vulnerable to reintroducing erased concepts through simple fine-tuning. This research examines the internal representations of unlearned models, revealing that many successful MU methods achieve their results due to "feature-classifier misalignment" rather than true unlearning. Hidden features often remain highly discriminative, allowing near-original accuracy to be recovered via simple linear probing. The study demonstrates that adjusting only the classifier can achieve negligible forget accuracy while preserving retain accuracy, supported by experiments with classifier-only fine-tuning. Motivated by these observations, the authors propose MU methods based on a class-mean features (CMF) classifier, which explicitly enforces alignment between features and classifiers. Experiments on standard benchmarks show CMF-based unlearning reduces forgotten information in representations while maintaining high retain accuracy.
Key takeaway
For research scientists developing or evaluating machine unlearning techniques, you should prioritize representation-level analysis over output-level behavior to accurately assess unlearning effectiveness. Relying solely on output metrics risks deploying models that appear unlearned but retain sensitive information in their hidden features, making them vulnerable to recovery. Implement methods like linear probing to verify true erasure and consider CMF-based classifiers for more robust unlearning.
Key insights
Many machine unlearning methods create feature-classifier misalignment, not true erasure, leaving hidden features discriminative.
Principles
- Internal representations reveal true unlearning.
- Feature-classifier misalignment can mask unlearning failures.
- Classifier-only adjustments can achieve forget accuracy.
Method
The proposed MU method uses a class-mean features (CMF) classifier to explicitly enforce alignment between features and classifiers, reducing forgotten information in representations.
In practice
- Evaluate MU methods using representation-level analysis.
- Consider linear probing to assess hidden feature discriminability.
- Explore CMF-based classifiers for robust unlearning.
Topics
- Machine Unlearning
- Internal Representations
- Feature-Classifier Misalignment
- Neural Collapse
- Class-Mean Features Classifier
Best for: Research Scientist, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.