Harmful Intent as a Geometrically Recoverable Feature of LLM Residual Streams
Summary
Research across 12 large language models from four architectural families (Qwen2.5, Qwen3.5, Llama-3.2, Gemma-3) and three alignment variants demonstrates that harmful intent is geometrically recoverable from model residual streams. This intent manifests as a linear direction in most layers and as angular deviation in layers where projection methods fail. Three direction-finding strategies proved successful: a soft-AUC-optimized linear direction achieved a mean AUROC of 0.98 and TPR@1%FPR of 0.80; a class-mean probe reached 0.98 AUROC and 0.71 TPR@1%FPR with minimal fitting cost; and a supervised angular-deviation strategy achieved AUROC 0.96 and TPR 0.61, uniquely sustaining detection in middle layers. Detection stability was observed across alignment variants, including abliterated models, indicating a functional dissociation between harmful intent and refusal behavior. Directions fitted on AdvBench transferred effectively to HarmBench and JailbreakBench, maintaining AUROC $\geq$0.96, and scalability tests with Qwen3.5 models from 0.8B to 9B parameters showed consistent AUROC $\geq$0.98.
Key takeaway
For research scientists evaluating LLM safety, you should prioritize TPR@1%FPR alongside AUROC to accurately assess operational detectability of harmful intent. The demonstrated stability and transferability of intent detection methods across model variants and scales suggest that upstream recognition signals for harmful content persist even after alignment, requiring robust, geometry-aware detection mechanisms in your safety pipelines.
Key insights
Harmful intent is geometrically recoverable in LLM residual streams, independent of alignment or model scale.
Principles
- Harmful intent is linearly decodable.
- Intent and refusal are functionally dissociated.
Method
Three strategies succeed: soft-AUC-optimized linear direction, class-mean probe, and supervised angular-deviation for robust detection.
In practice
- Use TPR@1%FPR with AUROC for safety evaluation.
- Directions transfer across benchmarks and model scales.
Topics
- Harmful Intent Detection
- LLM Residual Streams
- Geometric Feature Recovery
- Model Alignment
- Safety Evaluation Metrics
Best for: Research Scientist, CTO, VP of Engineering/Data, AI Scientist, Machine Learning Engineer, AI Security Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.