Detoxification for LLM: From Dataset Itself

2025-09-24 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, extended

Summary

Researchers from the State Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of Sciences, and University of Chinese Academy of Sciences introduce HSPD (Hierarchical Semantic-Preserving Detoxification), a novel pipeline for detoxifying large language model (LLM) pretraining datasets. Unlike existing methods that focus on post-training or inference, HSPD tackles toxicity at its source by rewriting raw corpora. The pipeline leverages SoCD (Soft Contrastive Decoding) to guide an LLM in localizing and rewriting toxic spans while preserving semantics. Evaluated on GPT2-XL, HSPD reduced Toxicity Probability (TP) from 0.42 to 0.18 and Expected Maximum Toxicity (EMT) from 0.43 to 0.20, achieving state-of-the-art results. Consistent improvements were also observed on LLaMA2-7B, OPT-6.7B, and Falcon-7B, demonstrating effective toxicity suppression with minimal impact on data utility and downstream task performance.

Key takeaway

For AI Engineers and Research Scientists focused on LLM safety, this work presents a compelling argument for shifting detoxification efforts to the pretraining data itself. Implementing the HSPD pipeline can fundamentally reduce a model's inherent toxicity, leading to safer downstream applications and potentially lowering the computational cost associated with post-training alignment. Consider integrating corpus-level detoxification early in your LLM development lifecycle to build more robust and ethically sound models.

Key insights

Detoxifying LLM pretraining datasets directly at the source fundamentally reduces inherent model toxicity while preserving semantics.

Principles

Address toxicity at the dataset level.
Preserve semantic fidelity during detoxification.
Combine prompt steering with adaptive decoding.

Method

The HSPD pipeline uses prompt steering for meaning-preserving rewriting, SoCD for adaptive logit intervention via a toxic model, and multi-temperature candidate search with fusion re-ranking to select the best detoxified text.

In practice

Fine-tune a small language model on toxic data.
Apply SoCD to regulate toxic-token logits.
Use semantic similarity for re-ranking detoxified candidates.

Topics

Large Language Models
Dataset Detoxification
Hierarchical Semantic-Preserving Detoxification
Soft Contrastive Decoding
Toxicity Mitigation

Code references

Best for: AI Engineer, NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Ethicist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.