Segment-Level Coherence for Robust Harmful Intent Probing in LLMs

2026-04-16 · Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy · Depth: Expert, quick

Summary

A new streaming probing objective has been developed to enhance the detection of harmful intent in Large Language Models (LLMs), particularly against adaptive jailbreaking in high-stakes Chemical, Biological, Radiological, and Nuclear (CBRN) domains. The method addresses the limitation of existing techniques that generate false alarms by relying on isolated high-scoring tokens, especially when sensitive CBRN terms appear in benign contexts. This novel approach requires multiple evidence tokens to consistently support a prediction, aggregating signals for more robust detection. At a 1% false-positive rate, it improves the true-positive rate by 35.55% relative to strong streaming baselines and shows substantial gains in AUROC, even from a near-saturated baseline of 97.40%. The research also indicates that probing Attention or MLP activations is more effective than residual-stream features, and that probes for base LLMs can detect harmful intent in character-level ciphers from adversarial fine-tuning, achieving an AUROC over 98.85%.

Key takeaway

For research scientists developing LLM safety mechanisms, this work highlights the need to move beyond single-token anomaly detection. You should prioritize methods that aggregate consistent evidence across multiple tokens to reduce false positives, especially in sensitive domains like CBRN. Consider probing Attention or MLP activations for superior performance and note that existing probes can effectively detect novel obfuscation techniques.

Key insights

Robust harmful intent detection in LLMs requires aggregated evidence, not isolated token spikes, to prevent false positives.

Principles

Multiple evidence tokens enhance detection.
Attention/MLP activations are superior for probing.
Base LLM probes generalize to obfuscated attacks.

Method

A streaming probing objective aggregates multiple consistent evidence tokens to support a prediction, moving beyond single-token cues for robust harmful intent detection in LLMs.

In practice

Implement multi-token evidence for intent detection.
Focus probing on Attention or MLP activations.
Apply existing probes to detect novel character ciphers.

Topics

LLM Jailbreaking
CBRN Domains
Segment-Level Coherence
Streaming Probing
Attention/MLP Activations

Best for: Research Scientist, CTO, VP of Engineering/Data, AI Scientist, Machine Learning Engineer, AI Security Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.