Harmonizing Multi-Objective LLM Unlearning via Unified Domain Representation and Bidirectional Logit Distillation

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy · Depth: Expert, extended

Summary

A novel multi-objective unlearning framework for Large Language Models (LLMs) is proposed to address the simultaneous challenges of removing undesirable knowledge, preserving general utility, avoiding over-refusal of neighboring concepts, and ensuring robustness against adversarial probing attacks. Existing methods typically focus on a limited subset of these goals, often leading to task interference or vulnerabilities. This new framework, named "Harmonizing Multi-Objective LLM Unlearning via Unified Domain Representation and Bidirectional Logit Distillation," employs a data and optimization co-design. It standardizes training corpora into a unified data representation to reduce domain gaps and introduces a bidirectional distillation method. This method simultaneously elicits desired behavior from a context-instructed teacher model while suppressing undesirable behavior in the student model. Evaluations on MUSE-Book and WMDP-Cyber benchmarks demonstrate state-of-the-art performance, achieving a prefilling attack success rate as low as 12.5% on MUSE-Book and 5.1% on WMDP-Cyber, while maintaining high retention accuracy on general and neighboring domains.

Key takeaway

For research scientists developing robust and safe LLMs, you should consider integrating data standardization and bidirectional logit distillation into your unlearning pipelines. This approach effectively mitigates adversarial attacks and over-refusal while preserving model utility, offering a more balanced and reliable solution compared to methods focusing on single objectives or gradient editing alone. Prioritize aligning data representations to reduce gradient conflicts and enhance synergistic optimization across diverse unlearning goals.

Key insights

Multi-objective LLM unlearning requires harmonizing data representation and employing bidirectional distillation for balanced efficacy and robustness.

Principles

Method

Standardize diverse training data into a unified Question-Answering format, augment with contrastive anchor pairs, then apply bidirectional Top-K logit distillation using a CoT-instructed teacher to suppress student's hazardous logits and encourage desired behavior.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Security Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.