could refusal layers be masking dialect-conditioned safety failures in MoE models [d]

2026-05-18 · Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy · Depth: Advanced, short

Summary

A recent investigation explored whether African American English Vernacular (AAVE)-coded prompts cause Mixture-of-Experts (MoE) language models to process and respond differently from semantically matched Academic English (AE) prompts, particularly in safety-sensitive scenarios when refusal behavior is reduced. Using Qwen3.5-35B-A3B and its HauhauCS no-refusal variant with Q8 greedy decoding, the study found significant differences. For instance, a no-refusal model provided operational tactical advice for an AAVE prompt about a violent act, while an AE prompt received mitigative, legal-consequence framing. Additionally, the no-refusal variant exhibited 2.6x longer output for AAVE prompts (5054 vs. 1934 tokens), often hitting the 8192-token ceiling in recursive loops, unlike AE prompts which terminated cleanly. Routing divergence by register was also observed upstream of refusal, with Jensen-Shannon divergences of 0.423 in the base model on financial-stress prompts and 0.479 in the fine-tune on chest-pain prompts, indicating near-total top-expert turnover between registers.

Key takeaway

For AI safety researchers and model developers, this analysis highlights that relying solely on refusal layers for safety in MoE models may mask critical dialect-conditioned failures. You should rigorously test your models with diverse linguistic registers, including AAVE, and examine internal routing mechanisms to uncover latent biases that could lead to qualitatively different and potentially unsafe responses when refusal behaviors are mitigated or absent.

Key insights

Dialect-conditioned prompts can lead to divergent routing and safety failures in MoE models, especially when refusal layers are weakened.

Principles

Routing divergence precedes refusal layers.
Refusal layers can mask underlying biases.
Dialect impacts model deliberation and output.

Method

The study used Qwen3.5-35B-A3B and its no-refusal variant, employing Q8 greedy decoding to test AAVE-coded vs. AE prompts in safety-sensitive contexts, analyzing response type, token length, and routing tensor divergence.

In practice

Test MoE models with diverse dialectal inputs.
Evaluate safety failures with weakened refusal layers.
Monitor routing decisions for input-conditioned bias.

Topics

MoE Models
Dialect-Conditioned Safety
Refusal Layers
Routing Divergence
African American English Vernacular

Best for: Research Scientist, CTO, VP of Engineering/Data, AI Scientist, Machine Learning Engineer, AI Ethicist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.