Low-Resource Safety Failures Are Action Failures, Not Representation Failures
Summary
Research on safety alignment in large language models reveals that capabilities learned in high-resource languages transfer poorly to low-resource counterparts. Models like Qwen2.5-7B, Gemma-2-9B, and Llama-3.1-8B, when tested across 23 languages, show a significant drop in harmful prompt refusal from 87.9% to 43.9% when translated into languages such as Swahili or Burmese. This failure persists even with adaptive steering methods like AdaSteer and CAST. A key diagnosis indicates that the underlying harmfulness representation is present and effective in low-resource contexts; the issue lies in the model's inability to convert this representation into a refusal action, specifically a "calibration of the safety decision" failure. The study proposes a recalibration approach, rather than retraining, using a low-rank logistic readout with its decision threshold reset with just 1 to 4 target-language examples per class. This method significantly boosts mean refusal selectivity ($Δ$ = harmful $-$ harmless refusal) from 33.6 to 54.5 while maintaining MMLU utility.
Key takeaway
For NLP Engineers deploying large language models in low-resource language environments, particularly where safety alignment is critical, you should prioritize recalibrating existing safety mechanisms rather than undertaking full retraining. Your efforts can focus on adjusting decision thresholds of high-resource safety gates using as few as 1 to 4 target-language examples per class. This approach efficiently improves refusal selectivity, boosting safety performance from 33.6 to 54.5, while preserving model utility, offering a practical path to robust cross-lingual safety.
Key insights
Low-resource safety failures stem from decision calibration, not representation transfer, offering a recalibration solution.
Principles
- Safety alignment transfer often fails at decision calibration.
- Harmfulness representations can be cross-lingually robust.
- Recalibration is more efficient than retraining.
Method
Recalibrate a high-resource safety gate's decision threshold using a low-rank logistic readout and 1-4 target-language examples per class, routing between refusal steering and harmfulness-direction ablation.
In practice
- Use minimal target-language data for safety recalibration.
- Adjust decision thresholds for cross-lingual safety.
- Implement low-rank logistic readouts for safety gating.
Topics
- Low-Resource Languages
- LLM Safety Alignment
- Cross-Lingual Transfer
- Model Recalibration
- Decision Thresholds
- Harmfulness Detection
Code references
Best for: Research Scientist, AI Scientist, NLP Engineer, AI Ethicist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.