Curvature-Guided Module Localization for Low-Rank Detoxification of Backdoored Large Language Models
Summary
A new mechanistically guided weight-space repair framework addresses post hoc detoxification of backdoored large language models (LLMs) without requiring full network retraining. This approach first localizes modules responsible for propagating trigger-induced malicious behavior using activation patching combined with Fisher/K-FAC curvature analysis. Subsequently, the framework applies targeted low-rank repair exclusively to these identified, most influential modules. Evaluated on poisoned variants of Llama-3.2-1B-Instruct, with triggers placed at the beginning, middle, and end of prompts, the method effectively suppresses trigger-conditioned malicious responses. Crucially, it preserves the model's benign behavior, suggesting that LLM backdoor removal can be framed as a localized structural repair problem rather than solely a broad behavioral alignment issue.
Key takeaway
For AI Security Engineers tasked with mitigating backdoors in deployed LLMs, this research offers a viable post hoc detoxification strategy. You can avoid costly full model retraining by focusing on localized structural repair. Implement activation patching and curvature analysis to pinpoint malicious modules, then apply targeted low-rank repairs. This approach effectively suppresses trigger-conditioned malicious outputs while preserving your model's intended benign functionality, streamlining incident response for compromised LLMs.
Key insights
Curvature-guided module localization and low-rank repair effectively detoxify backdoored LLMs post hoc, preserving benign behavior.
Principles
- Backdoor removal is localized structural repair.
- Trigger behavior propagates via specific modules.
- Post hoc repair avoids full retraining.
Method
Localize trigger-propagating modules via activation patching and Fisher/K-FAC curvature analysis. Apply targeted low-rank repair only to these most influential modules.
In practice
- Apply to Llama-3.2-1B-Instruct variants.
- Target triggers at any prompt position.
- Repair backdoors without full retraining.
Topics
- LLM Security
- Backdoor Attacks
- Model Detoxification
- Low-Rank Adaptation
- Curvature Analysis
- Llama-3.2-1B-Instruct
Best for: Research Scientist, CTO, AI Engineer, AI Scientist, AI Security Engineer, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.