From Concept-Aligned Tokens to Vulnerable Features: Mechanistic Localization of Jailbreaks

2026-06-18 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy · Depth: Expert, long

Summary

Research on Gemma-2-2B reveals that jailbreak vulnerabilities are localized to feature subgroups within its mid to later layers (16-25), suggesting that targeted feature-level interventions may offer a more principled path to adversarial robustness than current prompt-level defenses. This study employed a three-stage pipeline on Gemma-2-2B using the BeaverTails dataset: extracting concept-aligned tokens from adversarial responses, grouping Sparse Autoencoder (SAE) features across 26 model layers using cluster, hierarchical-linkage, and single-token-driven strategies, and then steering the model by amplifying top features. All three methods consistently showed that features in layers 16-25 were more vulnerable to steering, increasing harmfulness scores. Grok-4-1-fast-non-reasoning was used for negative sentiment extraction and as an LLM judge.

Key takeaway

For AI safety engineers developing robust LLMs, this research indicates that focusing solely on prompt-level defenses is insufficient. You should investigate and intervene at the feature-level, particularly within layers 16-25 of models like Gemma-2-2B, where jailbreak vulnerabilities are localized. Consider implementing targeted feature steering or suppression techniques to enhance adversarial robustness, moving beyond heuristic approaches.

Key insights

Jailbreak vulnerabilities in LLMs are localized to mid-to-later layer feature subgroups, enabling targeted interventions.

Principles

LLM safety alignment can be bypassed by internal feature vulnerabilities.
Mechanistic interpretability reveals specific neural feature subgroups.
Mid-to-later layers (16-25) are more susceptible to adversarial steering.

Method

A three-stage pipeline: extract concept-aligned tokens, group SAE features using cluster, hierarchical-linkage, or single-token-driven strategies, then steer the model by amplifying top features.

In practice

Analyze layers 16-25 for adversarial feature subgroups.
Use SAEs to decompose latent representations into interpretable features.
Focus on single-token-driven steering for stronger signals.

Topics

Mechanistic Interpretability
LLM Safety
Adversarial Attacks
Sparse Autoencoders
Feature Steering
Gemma-2-2B

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Security Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.