Harmonizing Multi-Objective LLM Unlearning via Unified Domain Representation and Bidirectional Logit Distillation
Summary
A novel multi-objective unlearning framework for Large Language Models (LLMs) is proposed to address the simultaneous challenges of removing undesirable knowledge, preserving general utility, avoiding over-refusal of neighboring concepts, and ensuring robustness against adversarial probing attacks. Existing methods typically focus on a limited subset of these goals, often leading to task interference or vulnerabilities. This new framework, named "Harmonizing Multi-Objective LLM Unlearning via Unified Domain Representation and Bidirectional Logit Distillation," employs a data and optimization co-design. It standardizes training corpora into a unified data representation to reduce domain gaps and introduces a bidirectional distillation method. This method simultaneously elicits desired behavior from a context-instructed teacher model while suppressing undesirable behavior in the student model. Evaluations on MUSE-Book and WMDP-Cyber benchmarks demonstrate state-of-the-art performance, achieving a prefilling attack success rate as low as 12.5% on MUSE-Book and 5.1% on WMDP-Cyber, while maintaining high retention accuracy on general and neighboring domains.
Key takeaway
For research scientists developing robust and safe LLMs, you should consider integrating data standardization and bidirectional logit distillation into your unlearning pipelines. This approach effectively mitigates adversarial attacks and over-refusal while preserving model utility, offering a more balanced and reliable solution compared to methods focusing on single objectives or gradient editing alone. Prioritize aligning data representations to reduce gradient conflicts and enhance synergistic optimization across diverse unlearning goals.
Key insights
Multi-objective LLM unlearning requires harmonizing data representation and employing bidirectional distillation for balanced efficacy and robustness.
Principles
- Unlearning objectives often conflict due to data representation domain gaps.
- Unified data representation enables synergistic optimization.
- Bidirectional distillation precisely suppresses undesirable logits.
Method
Standardize diverse training data into a unified Question-Answering format, augment with contrastive anchor pairs, then apply bidirectional Top-K logit distillation using a CoT-instructed teacher to suppress student's hazardous logits and encourage desired behavior.
In practice
- Use QA format for diverse unlearning data.
- Employ contrastive anchors to sharpen decision boundaries.
- Apply Smooth L1 distance for logit distillation.
Topics
- LLM Unlearning
- Multi-Objective Optimization
- Bidirectional Logit Distillation
- Data Standardization
- Adversarial Robustness
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Security Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.