Beyond Attack Success Rate: Temporal Logit Observability for LLM Safety Failures

2026-05-28 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy · Depth: Expert, quick

Summary

Temporal Logit Observability (TLO) is a training-free diagnostic designed to reveal the hidden paths of LLM safety failures, moving beyond the single yes/no label provided by Attack Success Rate (ASR). TLO observes a compliance-refusal margin during decoding, mapping each model-attack condition onto a calibrated 2D plane. This method is particularly informative for distinguishing attacks that succeed for genuinely different reasons, where ASR offers little insight. Across four aligned LLMs and three jailbreak paradigms, TLO demonstrates that attacks with nearly identical ASR values can exhibit clearly different temporal patterns. The geometry derived from TLO largely matches refusal-direction probes from hidden states, though one model highlighted a limitation of its fixed-lexicon approach. A simple early-stop rule based on TLO successfully cuts successful jailbreaks by over 50% without generating false alarms on benign queries, emphasizing the need to report when and how failures unfold.

Key takeaway

For AI Security Engineers evaluating LLM safety, relying solely on Attack Success Rate (ASR) is insufficient. You should integrate Temporal Logit Observability (TLO) to understand the temporal dynamics of jailbreaks, revealing how failures unfold, not just if they occur. This allows you to develop more nuanced defenses, such as TLO-derived early-stop rules, which cut successful jailbreaks by over 50% without impacting benign queries, significantly enhancing model robustness.

Key insights

Temporal Logit Observability (TLO) reveals the temporal dynamics of LLM safety failures, going beyond simple attack success rates.

Principles

Attack success rate alone is insufficient.
Failure paths are observable from logits.
Different attacks have distinct temporal patterns.

Method

TLO observes a compliance-refusal margin during decoding, mapping model-attack conditions onto a calibrated 2D plane to reveal temporal failure patterns.

In practice

Implement TLO for deeper jailbreak analysis.
Derive early-stop rules from TLO for safety.
Improve LLM safety evaluation beyond ASR.

Topics

LLM Safety
Jailbreak Detection
Temporal Logit Observability
Attack Success Rate
Decoding Strategies
AI Security

Best for: MLOps Engineer, Research Scientist, CTO, AI Scientist, AI Security Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.