Detoxifying LLMs via Representation Erasure-Based Preference Optimization
Summary
Representation Erasure-based Preference Optimization (REPO) is a novel method for detoxifying large language models (LLMs) that addresses the fragility of prior defenses against adversarial prompting and relearning attacks. Unlike existing methods like DPO and NPO, which superficially reduce harmful continuations, REPO reformulates detoxification as a token-level preference problem. It uses a unique objective with preference data to force the representations of toxic continuations to converge toward their benign counterparts. Mechanistic analysis shows REPO induces deep, localized edits to toxicity-encoding neurons while preserving general model utility. Evaluated on GPT-2 Small, GPT-2 Medium, and Gemma 2B, REPO achieves state-of-the-art robustness against sophisticated threats, including relearning attacks and enhanced Greedy Coordinate Gradient (GCG) jailbreaks, outperforming other representation- and output-based methods.
Key takeaway
For research scientists and engineers developing safer LLMs, REPO offers a robust approach to detoxification that moves beyond superficial output suppression. By focusing on token-level representation erasure, your models can achieve superior resistance to advanced jailbreaks and relearning attacks. Consider integrating REPO's principles of deep, localized representational edits to build more resilient and trustworthy AI systems, ensuring safety interventions are durable rather than easily bypassed.
Key insights
REPO robustly detoxifies LLMs by erasing toxic representations at a token level, resisting adaptive attacks.
Principles
- Targeting internal representations yields more durable unlearning.
- Token-level granularity is critical for precise, localized edits.
- Adversarial invariance between retain/forget representations enhances robustness.
Method
REPO combines token-level anchoring to a frozen reference model on benign text with a token-granular adversarial objective to make retain and forget token representations indistinguishable.
In practice
- Use pairwise supervision for robust detoxification.
- Apply discriminators at the token level for precise control.
- Anchor to a reference model to preserve benign behavior.
Topics
- LLM Detoxification
- Representation Erasure
- Preference Optimization
- AI Safety
- Adversarial Robustness
Best for: Research Scientist, CTO, VP of Engineering/Data, AI Researcher, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.LG updates on arXiv.org.