InstantForget: Update-Free Backdoor Unlearning with Inference-Time Feature Reset
Summary
InstantForget is a novel method for update-free backdoor unlearning, designed to remove malicious trigger behavior from deployed models at inference time without altering model parameters. The research first audits a common projection assumption, finding it effective only on BadNets, while failing on WaNet, Blended, and SIG triggers, yielding high Attack Success Rates (ASR) of 0.683, 0.888, and 0.941 respectively on CIFAR-10 ResNet-18. This failure is attributed to a logit-triplet gap. InstantForget introduces a clean-calibrated gated reset that identifies anomalous features using a Mahalanobis score and shifts only these flagged features towards a neutral non-target representation. With a single fixed operating point, InstantForget reduces the average ASR to 0.071 across four non-adaptive CIFAR-10 triggers, achieves 0.981 detection AUROC, and transfers to six of eight tested backbones.
Key takeaway
For AI Security Engineers deploying models susceptible to backdoor attacks, InstantForget offers a critical update-free unlearning solution. You can mitigate backdoor risks at inference time, achieving an average ASR of 0.071 on CIFAR-10 triggers and 0.981 detection AUROC, without costly model retraining or access to triggered samples. Consider integrating this inference-time feature reset for robust post-deployment security.
Key insights
InstantForget enables update-free, inference-time backdoor unlearning by resetting anomalous features identified via a Mahalanobis score.
Principles
- Update-free backdoor unlearning is feasible at inference time.
- Logit-triplet gap predicts projection unlearning failures.
- Mahalanobis score flags anomalous triggered features.
Method
InstantForget employs a clean-calibrated gated reset. It flags anomalous features using a Mahalanobis score and shifts only these flagged features towards a neutral non-target representation, without model parameter updates.
In practice
- Reduces average ASR to 0.071 on CIFAR-10 triggers.
- Achieves 0.981 detection AUROC for trigger detection.
- Transfers to six of eight tested backbones.
Topics
- Backdoor Unlearning
- Inference-Time Security
- Feature Reset
- Mahalanobis Score
- Attack Success Rate
- Model Security
Best for: Research Scientist, CTO, VP of Engineering/Data, AI Scientist, AI Security Engineer, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.