Measuring and Mitigating Toxicity in Large Language Models: A Comprehensive Replication Study

· Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy · Depth: Expert, extended

Summary

A comprehensive replication study evaluated DExperts, an inference-time mitigation technique, for reducing toxicity in Large Language Models (LLMs) without retraining. The research, structured in three phases, first established a baseline of 4.2% toxic generation from non-toxic prompts using GPT-2 Small on RealToxicityPrompts. Second, DExperts achieved a 100% safety rate on explicit toxicity benchmarks, completely eliminating the toxic outputs. However, the method introduced a significant computational cost, increasing inference latency by approximately 10x, from 0.2s to 2.0s per generation. Third, stress-testing DExperts against adversarial implicit hate speech using the ToxiGen dataset revealed a robustness gap, with safety rates dropping to 98.5%. This degradation indicates brittleness against subtle, coded hate speech, and adversarial prompts further increased latency to 3.2s, demonstrating a "double penalty" of higher latency and reduced safety.

Key takeaway

For AI Engineers deploying LLMs in user-facing applications, recognize that current inference-time toxicity mitigation like DExperts, while effective against explicit toxicity, is brittle against implicit hate speech and introduces substantial latency. You should plan for hybrid mitigation strategies combining DExperts with more sophisticated implicit hate detection, and carefully weigh the 10x latency penalty against your application's real-time performance requirements to avoid a "double penalty" of slow and unsafe outputs.

Key insights

Inference-time toxicity mitigation is effective for explicit toxicity but struggles with implicit hate speech and incurs significant latency.

Principles

Method

DExperts steers LLM generation by combining log-probabilities from a base model, an expert model (non-toxic), and an anti-expert model (toxic) at each decoding step, using a hyperparameter $\alpha$ to control steering strength.

In practice

Topics

Best for: Research Scientist, AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer, AI Security Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.