Detoxification for LLM: From Dataset Itself
Summary
A new detoxification method for large language models (LLMs) called Hierarchical Semantic-Preserving Detoxification (HSPD) directly addresses toxicity at the dataset level, rather than relying on post-training or inference-time interventions. HSPD employs Soft Contrastive Decoding (SoCD) to guide an LLM in localizing and rewriting toxic spans within raw corpora while preserving the original semantics. This approach aims to fundamentally reduce the inherent toxicity learned by models during training. When applied to GPT2-XL, HSPD achieved a significant reduction in Toxicity Probability (TP) from 0.42 to 0.18 and Expected Maximum Toxicity (EMT) from 0.43 to 0.20. The method also demonstrated consistent best-in-class results on LLaMA2-7B, OPT-6.7B, and Falcon-7B, indicating its effectiveness in suppressing downstream toxicity while maintaining data utility.
Key takeaway
For AI engineers and research scientists developing or fine-tuning large language models, consider integrating dataset-level detoxification using methods like HSPD. This approach offers a more fundamental solution to reducing model toxicity compared to post-training adjustments, potentially lowering the long-term cost and complexity of model behavior alignment. Prioritize pre-processing your training data with semantic-preserving rewriting techniques to build inherently safer models.
Key insights
Detoxifying pretraining datasets directly and semantically reduces LLM toxicity more effectively than post-training methods.
Principles
- Dataset-level intervention is fundamental.
- Semantic preservation is crucial for utility.
Method
The HSPD pipeline uses Soft Contrastive Decoding (SoCD) to guide an LLM in identifying and rewriting toxic spans in raw data, ensuring semantic preservation for a detoxified corpus.
In practice
- Apply HSPD to pretraining corpora.
- Use SoCD for targeted text rewriting.
Topics
- LLM Detoxification
- Dataset Detoxification
- Hierarchical Semantic-Preserving Detoxification
- Soft Contrastive Decoding
- Toxicity Probability
Code references
Best for: AI Engineer, NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Ethicist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.