Detoxifying LLMs via Representation Erasure-Based Preference Optimization

2026-03-02 · Source: cs.LG updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy · Depth: Expert, extended

Summary

Representation Erasure-based Preference Optimization (REPO) is a novel method for detoxifying large language models (LLMs) that addresses the fragility of prior defenses against adversarial prompting and relearning attacks. Unlike existing methods like DPO and NPO, which superficially reduce harmful continuations, REPO reformulates detoxification as a token-level preference problem. It uses a unique objective with preference data to force the representations of toxic continuations to converge toward their benign counterparts. Mechanistic analysis shows REPO induces deep, localized edits to toxicity-encoding neurons while preserving general model utility. Evaluated on GPT-2 Small, GPT-2 Medium, and Gemma 2B, REPO achieves state-of-the-art robustness against sophisticated threats, including relearning attacks and enhanced Greedy Coordinate Gradient (GCG) jailbreaks, outperforming other representation- and output-based methods.

Key takeaway

For research scientists and engineers developing safer LLMs, REPO offers a robust approach to detoxification that moves beyond superficial output suppression. By focusing on token-level representation erasure, your models can achieve superior resistance to advanced jailbreaks and relearning attacks. Consider integrating REPO's principles of deep, localized representational edits to build more resilient and trustworthy AI systems, ensuring safety interventions are durable rather than easily bypassed.

Key insights

REPO robustly detoxifies LLMs by erasing toxic representations at a token level, resisting adaptive attacks.

Principles

Targeting internal representations yields more durable unlearning.
Token-level granularity is critical for precise, localized edits.
Adversarial invariance between retain/forget representations enhances robustness.

Method

REPO combines token-level anchoring to a frozen reference model on benign text with a token-granular adversarial objective to make retain and forget token representations indistinguishable.

In practice

Use pairwise supervision for robust detoxification.
Apply discriminators at the token level for precise control.
Anchor to a reference model to preserve benign behavior.

Topics

LLM Detoxification
Representation Erasure
Preference Optimization
AI Safety
Adversarial Robustness

Best for: Research Scientist, CTO, VP of Engineering/Data, AI Researcher, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.LG updates on arXiv.org.