Why Do Safety Guardrails Degrade Across Languages?
Summary
A new study introduces a Multi-Group Item Response Theory (IRT) framework to analyze why large language models (LLMs) exhibit safety degradation in non-English languages. Standard evaluation methods, like Jailbreak Success Rate (JSR), conflate multiple factors, obscuring the root causes of safety failures. The IRT framework decouples these factors into language-agnostic safety robustness ($\theta$), intrinsic prompt hardness ($\beta$), global language processing difficulty ($\gamma$), and a prompt-specific cross-lingual safety gap ($\tau$). Researchers evaluated 61 model configurations across 5 closed-model families and 10 languages, generating a dataset of 1.9 million rows. Findings indicate that safety is primarily unidimensional, meaning models refuse different harm types through a shared mechanism. Contrary to expectations, 22 model configurations were more vulnerable in English than in low-resource languages, and low-resource languages produced more uncertain responses. The framework achieved an AUC of 0.940 in predicting safe refusal, outperforming simpler baselines.
Key takeaway
For research scientists evaluating multilingual LLM safety, you should adopt the Multi-Group IRT framework to move beyond aggregate metrics like JSR. This framework allows for precise diagnosis of cross-lingual safety gaps ($\tau_{iL}$), enabling targeted remediation of translation issues, cultural mismatches, or specific fine-tuning needs, thereby improving the fairness and accuracy of cross-lingual safety evaluations and dataset construction.
Key insights
A new IRT framework precisely diagnoses multilingual LLM safety failures by decoupling contributing factors beyond aggregate metrics.
Principles
- LLM safety is largely unidimensional.
- Aggregate metrics like JSR can obscure specific failure causes.
- Translation quality is a minor factor in cross-lingual safety gaps.
Method
The Multi-Group IRT framework models the probability of a safe response using parameters for model ability ($\theta$, $\delta_{jL}$), prompt difficulty ($\beta$, $\gamma_{L}$), and prompt-specific cross-lingual safety gaps ($\tau_{iL}$), with English as the reference language.
In practice
- Use the IRT framework to identify specific prompt-language pairs for remediation.
- Prioritize universally understood harms in multilingual jailbreak benchmarks.
- Implement multi-pass generation (e.g., Pass@10) for robust safety evaluation.
Topics
- Multi-Group Item Response Theory
- Large Language Model Safety
- Cross-lingual Safety Gaps
- Jailbreak Success Rate
- Multilingual Evaluation
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.