Why Do Safety Guardrails Degrade Across Languages?

2026-05-19 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, extended

Summary

A new study introduces a Multi-Group Item Response Theory (IRT) framework to analyze why large language models (LLMs) exhibit safety degradation in non-English languages. Standard evaluation methods, like Jailbreak Success Rate (JSR), conflate multiple factors, obscuring the root causes of safety failures. The IRT framework decouples these factors into language-agnostic safety robustness ($\theta$), intrinsic prompt hardness ($\beta$), global language processing difficulty ($\gamma$), and a prompt-specific cross-lingual safety gap ($\tau$). Researchers evaluated 61 model configurations across 5 closed-model families and 10 languages, generating a dataset of 1.9 million rows. Findings indicate that safety is primarily unidimensional, meaning models refuse different harm types through a shared mechanism. Contrary to expectations, 22 model configurations were more vulnerable in English than in low-resource languages, and low-resource languages produced more uncertain responses. The framework achieved an AUC of 0.940 in predicting safe refusal, outperforming simpler baselines.

Key takeaway

For research scientists evaluating multilingual LLM safety, you should adopt the Multi-Group IRT framework to move beyond aggregate metrics like JSR. This framework allows for precise diagnosis of cross-lingual safety gaps ($\tau_{iL}$), enabling targeted remediation of translation issues, cultural mismatches, or specific fine-tuning needs, thereby improving the fairness and accuracy of cross-lingual safety evaluations and dataset construction.

Key insights

A new IRT framework precisely diagnoses multilingual LLM safety failures by decoupling contributing factors beyond aggregate metrics.

Principles

LLM safety is largely unidimensional.
Aggregate metrics like JSR can obscure specific failure causes.
Translation quality is a minor factor in cross-lingual safety gaps.

Method

The Multi-Group IRT framework models the probability of a safe response using parameters for model ability ($\theta$, $\delta_{jL}$), prompt difficulty ($\beta$, $\gamma_{L}$), and prompt-specific cross-lingual safety gaps ($\tau_{iL}$), with English as the reference language.

In practice

Use the IRT framework to identify specific prompt-language pairs for remediation.
Prioritize universally understood harms in multilingual jailbreak benchmarks.
Implement multi-pass generation (e.g., Pass@10) for robust safety evaluation.

Topics

Multi-Group Item Response Theory
Large Language Model Safety
Cross-lingual Safety Gaps
Jailbreak Success Rate
Multilingual Evaluation

Code references

aims-foundations/safety-irt

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.