Detoxifying LLMs via Representation Erasure-Based Preference Optimization

· Source: cs.LG updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy · Depth: Expert, extended

Summary

Representation Erasure-based Preference Optimization (REPO) is a novel method for detoxifying large language models (LLMs) that addresses the fragility of prior defenses against adversarial prompting and relearning attacks. Unlike existing methods like DPO and NPO, which superficially reduce harmful continuations, REPO reformulates detoxification as a token-level preference problem. It uses a unique objective with preference data to force the representations of toxic continuations to converge toward their benign counterparts. Mechanistic analysis shows REPO induces deep, localized edits to toxicity-encoding neurons while preserving general model utility. Evaluated on GPT-2 Small, GPT-2 Medium, and Gemma 2B, REPO achieves state-of-the-art robustness against sophisticated threats, including relearning attacks and enhanced Greedy Coordinate Gradient (GCG) jailbreaks, outperforming other representation- and output-based methods.

Key takeaway

For research scientists and engineers developing safer LLMs, REPO offers a robust approach to detoxification that moves beyond superficial output suppression. By focusing on token-level representation erasure, your models can achieve superior resistance to advanced jailbreaks and relearning attacks. Consider integrating REPO's principles of deep, localized representational edits to build more resilient and trustworthy AI systems, ensuring safety interventions are durable rather than easily bypassed.

Key insights

REPO robustly detoxifies LLMs by erasing toxic representations at a token level, resisting adaptive attacks.

Principles

Method

REPO combines token-level anchoring to a frozen reference model on benign text with a token-granular adversarial objective to make retain and forget token representations indistinguishable.

In practice

Topics

Best for: Research Scientist, CTO, VP of Engineering/Data, AI Researcher, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.LG updates on arXiv.org.