Patcher: Post-Hoc Patching of Backdoored Large Language Models

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy · Depth: Expert, quick

Summary

Patcher, a novel post-hoc defense framework, addresses the vulnerability of large language models to jailbreak backdoor attacks, which embed hidden triggers via poisoned safety alignment data. Unlike existing defenses that demand comprehensive attack information or multiple triggered examples, Patcher operates effectively with only a single reported failure case and the model's parameters. The framework functions in two stages: first, it localizes backdoor triggers by computing response-conditioned gradient-based saliency scores and applying adaptive clustering; second, it patches the model through a constrained fine-tuning objective. This objective breaks the trigger-response association while preserving benign-task utility and robustness to non-triggered jailbreak attacks via KL-divergence constraints. Extensive evaluations demonstrate Patcher's success in localizing triggers, neutralizing backdoors, maintaining model utility, and resisting adaptive attacks.

Key takeaway

For AI Security Engineers deploying or maintaining large language models, Patcher offers a practical solution for post-hoc backdoor remediation. If you encounter a single reported safety failure, you can now localize hidden triggers and patch your model without extensive attack data. This approach allows you to neutralize backdoors and preserve model utility, significantly improving the security posture of your deployed LLMs against training-time attacks.

Key insights

Patcher repairs backdoored LLMs post-hoc using a single failure case, localizing triggers via saliency and patching with constrained fine-tuning.

Principles

Method

Patcher localizes triggers using response-conditioned gradient-based saliency scores and adaptive clustering. It then patches the model via constrained fine-tuning, breaking trigger-response associations while maintaining utility through KL-divergence constraints.

In practice

Topics

Best for: Research Scientist, CTO, VP of Engineering/Data, AI Scientist, AI Security Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.