Response-Based Knowledge Distillation for Multilingual Jailbreak Prevention Unwittingly Compromises Safety

2026-02-13 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Natural Language Processing, Cybersecurity & Data Privacy · Depth: Advanced, quick

Summary

A new study investigates the application of knowledge distillation (KD) for multilingual jailbreak prevention in large language models (LLMs), focusing on non-English contexts where safety alignment is often weak. Researchers distilled the refusal behaviors of a proprietary OpenAI o1-mini teacher model into three open-source student models: Meta-Llama-3-8B-Instruct, Gemma-2-2B-IT, and Qwen3-8B. This was achieved using Low-Rank Adaptation (LoRA) and approximately 28,000 multilingual jailbreak prompts from XSafety via black-box response-based, parameter-efficient fine-tuning (PEFT). Evaluation on the MultiJail benchmark revealed that standard fine-tuning on the teacher's "safe" refusal data unexpectedly increased the Jailbreak Success Rate (JSR) for all student models by up to 16.6 percentage points. The study found that removing "boundary" refusals, a primary source of safety degradation, could mitigate or reverse these safety declines, though reasoning performance (GSM8K) reductions persisted.

Key takeaway

For research scientists developing multilingual LLM safety mechanisms, you should critically evaluate knowledge distillation techniques, as standard fine-tuning on "safe" refusal data can unexpectedly increase jailbreak success rates. Focus on analyzing and potentially filtering nuanced "boundary" refusals in your training datasets to prevent unintended safety compromises, while also monitoring for potential declines in reasoning performance.

Key insights

Knowledge distillation for multilingual jailbreak prevention can inadvertently compromise LLM safety, increasing jailbreak success rates.

Principles

LLM safety alignment is predominantly English-centric.
Distillation can lead to divergent generalization in unseen languages.

Method

Refusal behaviors from a proprietary teacher model (OpenAI o1-mini) were distilled into open-source student models (Llama-3-8B, Gemma-2-2B, Qwen3-8B) using LoRA and ~28,000 multilingual jailbreak prompts via black-box response-based PEFT.

In practice

Evaluate KD for multilingual safety beyond English.
Carefully analyze "boundary" refusals in training data.
Monitor reasoning performance post-distillation.

Topics

Multilingual Jailbreak Prevention
Knowledge Distillation
LLM Safety Alignment
Parameter-Efficient Fine-Tuning
Large Language Models

Best for: Research Scientist, AI Researcher, AI Scientist, AI Ethicist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.