Response-Based Knowledge Distillation for Multilingual Jailbreak Prevention Unwittingly Compromises Safety
Summary
A new study investigates the application of knowledge distillation (KD) for multilingual jailbreak prevention in large language models (LLMs), focusing on non-English contexts where safety alignment is often weak. Researchers distilled the refusal behaviors of a proprietary OpenAI o1-mini teacher model into three open-source student models: Meta-Llama-3-8B-Instruct, Gemma-2-2B-IT, and Qwen3-8B. This was achieved using Low-Rank Adaptation (LoRA) and approximately 28,000 multilingual jailbreak prompts from XSafety via black-box response-based, parameter-efficient fine-tuning (PEFT). Evaluation on the MultiJail benchmark revealed that standard fine-tuning on the teacher's "safe" refusal data unexpectedly increased the Jailbreak Success Rate (JSR) for all student models by up to 16.6 percentage points. The study found that removing "boundary" refusals, a primary source of safety degradation, could mitigate or reverse these safety declines, though reasoning performance (GSM8K) reductions persisted.
Key takeaway
For research scientists developing multilingual LLM safety mechanisms, you should critically evaluate knowledge distillation techniques, as standard fine-tuning on "safe" refusal data can unexpectedly increase jailbreak success rates. Focus on analyzing and potentially filtering nuanced "boundary" refusals in your training datasets to prevent unintended safety compromises, while also monitoring for potential declines in reasoning performance.
Key insights
Knowledge distillation for multilingual jailbreak prevention can inadvertently compromise LLM safety, increasing jailbreak success rates.
Principles
- LLM safety alignment is predominantly English-centric.
- Distillation can lead to divergent generalization in unseen languages.
Method
Refusal behaviors from a proprietary teacher model (OpenAI o1-mini) were distilled into open-source student models (Llama-3-8B, Gemma-2-2B, Qwen3-8B) using LoRA and ~28,000 multilingual jailbreak prompts via black-box response-based PEFT.
In practice
- Evaluate KD for multilingual safety beyond English.
- Carefully analyze "boundary" refusals in training data.
- Monitor reasoning performance post-distillation.
Topics
- Multilingual Jailbreak Prevention
- Knowledge Distillation
- LLM Safety Alignment
- Parameter-Efficient Fine-Tuning
- Large Language Models
Best for: Research Scientist, AI Researcher, AI Scientist, AI Ethicist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.