Language-Switching Triggers Take a Latent Detour Through Language Models

· Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy · Depth: Expert, quick

Summary

Researchers have identified a specific circuit within an 8-billion-parameter autoregressive language model that enables a language-switching backdoor attack. This attack uses a three-word Latin trigger (nine tokens) to redirect the model's English output to French. The circuit operates in three distinct phases: early-layer attention heads compose the trigger into the final sequence position; this signal then propagates through mid-layers via a latent subspace orthogonal to the model's natural language-identity direction; finally, a Multi-Layer Perceptron (MLP) at the last layer converts this latent signal into French logits. A critical finding is that the entire circuit passes through a serial bottleneck at a single position, where corruption can mitigate the trigger but also degrade model capabilities.

Key takeaway

For research scientists and security engineers developing or deploying large language models, understanding this identified language-switching backdoor circuit is crucial. Your current defenses that rely on detecting language-like signals in intermediate representations may entirely miss such orthogonal latent triggers. You should investigate methods to detect and neutralize these specific serial bottlenecks without compromising overall model performance.

Key insights

A three-word Latin trigger can hijack an 8B-parameter LLM to switch output language via a specific, orthogonal latent circuit.

Principles

Method

The identified method involves a three-phase circuit: trigger composition by early attention, orthogonal latent signal propagation, and final layer MLP conversion to target language logits.

In practice

Topics

Best for: Research Scientist, CTO, VP of Engineering/Data, AI Scientist, AI Security Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.