An Illusion of Unlearning? Assessing Machine Unlearning Through Internal Representations

2026-04-09 · Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

Many machine unlearning (MU) methods, despite promising results in erasing data influence, are vulnerable to reintroducing erased concepts through simple fine-tuning. This research examines the internal representations of unlearned models, revealing that many successful MU methods achieve their results due to "feature-classifier misalignment" rather than true unlearning. Hidden features often remain highly discriminative, allowing near-original accuracy to be recovered via simple linear probing. The study demonstrates that adjusting only the classifier can achieve negligible forget accuracy while preserving retain accuracy, supported by experiments with classifier-only fine-tuning. Motivated by these observations, the authors propose MU methods based on a class-mean features (CMF) classifier, which explicitly enforces alignment between features and classifiers. Experiments on standard benchmarks show CMF-based unlearning reduces forgotten information in representations while maintaining high retain accuracy.

Key takeaway

For research scientists developing or evaluating machine unlearning techniques, you should prioritize representation-level analysis over output-level behavior to accurately assess unlearning effectiveness. Relying solely on output metrics risks deploying models that appear unlearned but retain sensitive information in their hidden features, making them vulnerable to recovery. Implement methods like linear probing to verify true erasure and consider CMF-based classifiers for more robust unlearning.

Key insights

Many machine unlearning methods create feature-classifier misalignment, not true erasure, leaving hidden features discriminative.

Principles

Internal representations reveal true unlearning.
Feature-classifier misalignment can mask unlearning failures.
Classifier-only adjustments can achieve forget accuracy.

Method

The proposed MU method uses a class-mean features (CMF) classifier to explicitly enforce alignment between features and classifiers, reducing forgotten information in representations.

In practice

Evaluate MU methods using representation-level analysis.
Consider linear probing to assess hidden feature discriminability.
Explore CMF-based classifiers for robust unlearning.

Topics

Machine Unlearning
Internal Representations
Feature-Classifier Misalignment
Neural Collapse
Class-Mean Features Classifier

Best for: Research Scientist, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.