InstantForget: Update-Free Backdoor Unlearning with Inference-Time Feature Reset

2026-06-14 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy · Depth: Expert, quick

Summary

InstantForget is a novel method for update-free backdoor unlearning, designed to remove malicious trigger behavior from deployed models at inference time without altering model parameters. The research first audits a common projection assumption, finding it effective only on BadNets, while failing on WaNet, Blended, and SIG triggers, yielding high Attack Success Rates (ASR) of 0.683, 0.888, and 0.941 respectively on CIFAR-10 ResNet-18. This failure is attributed to a logit-triplet gap. InstantForget introduces a clean-calibrated gated reset that identifies anomalous features using a Mahalanobis score and shifts only these flagged features towards a neutral non-target representation. With a single fixed operating point, InstantForget reduces the average ASR to 0.071 across four non-adaptive CIFAR-10 triggers, achieves 0.981 detection AUROC, and transfers to six of eight tested backbones.

Key takeaway

For AI Security Engineers deploying models susceptible to backdoor attacks, InstantForget offers a critical update-free unlearning solution. You can mitigate backdoor risks at inference time, achieving an average ASR of 0.071 on CIFAR-10 triggers and 0.981 detection AUROC, without costly model retraining or access to triggered samples. Consider integrating this inference-time feature reset for robust post-deployment security.

Key insights

InstantForget enables update-free, inference-time backdoor unlearning by resetting anomalous features identified via a Mahalanobis score.

Principles

Update-free backdoor unlearning is feasible at inference time.
Logit-triplet gap predicts projection unlearning failures.
Mahalanobis score flags anomalous triggered features.

Method

InstantForget employs a clean-calibrated gated reset. It flags anomalous features using a Mahalanobis score and shifts only these flagged features towards a neutral non-target representation, without model parameter updates.

In practice

Reduces average ASR to 0.071 on CIFAR-10 triggers.
Achieves 0.981 detection AUROC for trigger detection.
Transfers to six of eight tested backbones.

Topics

Backdoor Unlearning
Inference-Time Security
Feature Reset
Mahalanobis Score
Attack Success Rate
Model Security

Best for: Research Scientist, CTO, VP of Engineering/Data, AI Scientist, AI Security Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.