Low-Resource Safety Failures Are Action Failures, Not Representation Failures

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

Research on safety alignment in large language models reveals that capabilities learned in high-resource languages transfer poorly to low-resource counterparts. Models like Qwen2.5-7B, Gemma-2-9B, and Llama-3.1-8B, when tested across 23 languages, show a significant drop in harmful prompt refusal from 87.9% to 43.9% when translated into languages such as Swahili or Burmese. This failure persists even with adaptive steering methods like AdaSteer and CAST. A key diagnosis indicates that the underlying harmfulness representation is present and effective in low-resource contexts; the issue lies in the model's inability to convert this representation into a refusal action, specifically a "calibration of the safety decision" failure. The study proposes a recalibration approach, rather than retraining, using a low-rank logistic readout with its decision threshold reset with just 1 to 4 target-language examples per class. This method significantly boosts mean refusal selectivity ($Δ$ = harmful $-$ harmless refusal) from 33.6 to 54.5 while maintaining MMLU utility.

Key takeaway

For NLP Engineers deploying large language models in low-resource language environments, particularly where safety alignment is critical, you should prioritize recalibrating existing safety mechanisms rather than undertaking full retraining. Your efforts can focus on adjusting decision thresholds of high-resource safety gates using as few as 1 to 4 target-language examples per class. This approach efficiently improves refusal selectivity, boosting safety performance from 33.6 to 54.5, while preserving model utility, offering a practical path to robust cross-lingual safety.

Key insights

Low-resource safety failures stem from decision calibration, not representation transfer, offering a recalibration solution.

Principles

Method

Recalibrate a high-resource safety gate's decision threshold using a low-rank logistic readout and 1-4 target-language examples per class, routing between refusal steering and harmfulness-direction ablation.

In practice

Topics

Code references

Best for: Research Scientist, AI Scientist, NLP Engineer, AI Ethicist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.