Harmful Intent as a Geometrically Recoverable Feature of LLM Residual Streams

2026-04-20 · Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy, Research Methodology & Innovation · Depth: Expert, quick

Summary

Research across 12 large language models from four architectural families (Qwen2.5, Qwen3.5, Llama-3.2, Gemma-3) and three alignment variants demonstrates that harmful intent is geometrically recoverable from model residual streams. This intent manifests as a linear direction in most layers and as angular deviation in layers where projection methods fail. Three direction-finding strategies proved successful: a soft-AUC-optimized linear direction achieved a mean AUROC of 0.98 and TPR@1%FPR of 0.80; a class-mean probe reached 0.98 AUROC and 0.71 TPR@1%FPR with minimal fitting cost; and a supervised angular-deviation strategy achieved AUROC 0.96 and TPR 0.61, uniquely sustaining detection in middle layers. Detection stability was observed across alignment variants, including abliterated models, indicating a functional dissociation between harmful intent and refusal behavior. Directions fitted on AdvBench transferred effectively to HarmBench and JailbreakBench, maintaining AUROC $\geq$0.96, and scalability tests with Qwen3.5 models from 0.8B to 9B parameters showed consistent AUROC $\geq$0.98.

Key takeaway

For research scientists evaluating LLM safety, you should prioritize TPR@1%FPR alongside AUROC to accurately assess operational detectability of harmful intent. The demonstrated stability and transferability of intent detection methods across model variants and scales suggest that upstream recognition signals for harmful content persist even after alignment, requiring robust, geometry-aware detection mechanisms in your safety pipelines.

Key insights

Harmful intent is geometrically recoverable in LLM residual streams, independent of alignment or model scale.

Principles

Harmful intent is linearly decodable.
Intent and refusal are functionally dissociated.

Method

Three strategies succeed: soft-AUC-optimized linear direction, class-mean probe, and supervised angular-deviation for robust detection.

In practice

Use TPR@1%FPR with AUROC for safety evaluation.
Directions transfer across benchmarks and model scales.

Topics

Harmful Intent Detection
LLM Residual Streams
Geometric Feature Recovery
Model Alignment
Safety Evaluation Metrics

Best for: Research Scientist, CTO, VP of Engineering/Data, AI Scientist, Machine Learning Engineer, AI Security Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.