Language-Switching Triggers Take a Latent Detour Through Language Models
Summary
Researchers have identified a specific circuit within an 8-billion-parameter autoregressive language model that enables a language-switching backdoor attack. This attack uses a three-word Latin trigger (nine tokens) to redirect the model's English output to French. The circuit operates in three distinct phases: early-layer attention heads compose the trigger into the final sequence position; this signal then propagates through mid-layers via a latent subspace orthogonal to the model's natural language-identity direction; finally, a Multi-Layer Perceptron (MLP) at the last layer converts this latent signal into French logits. A critical finding is that the entire circuit passes through a serial bottleneck at a single position, where corruption can mitigate the trigger but also degrade model capabilities.
Key takeaway
For research scientists and security engineers developing or deploying large language models, understanding this identified language-switching backdoor circuit is crucial. Your current defenses that rely on detecting language-like signals in intermediate representations may entirely miss such orthogonal latent triggers. You should investigate methods to detect and neutralize these specific serial bottlenecks without compromising overall model performance.
Key insights
A three-word Latin trigger can hijack an 8B-parameter LLM to switch output language via a specific, orthogonal latent circuit.
Principles
- Backdoor triggers can operate in latent subspaces.
- Serial bottlenecks exist in trigger propagation.
Method
The identified method involves a three-phase circuit: trigger composition by early attention, orthogonal latent signal propagation, and final layer MLP conversion to target language logits.
In practice
- Corrupting a specific bottleneck position mitigates triggers.
- Defenses must consider orthogonal latent signals.
Topics
- Backdoor Attacks
- Language Models
- Language Switching
- Latent Space
- Attention Heads
Best for: Research Scientist, CTO, VP of Engineering/Data, AI Scientist, AI Security Engineer, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.