Beyond Attack Success Rate: Temporal Logit Observability for LLM Safety Failures
Summary
Temporal Logit Observability (TLO) is a training-free diagnostic designed to reveal the hidden paths of LLM safety failures, moving beyond the single yes/no label provided by Attack Success Rate (ASR). TLO observes a compliance-refusal margin during decoding, mapping each model-attack condition onto a calibrated 2D plane. This method is particularly informative for distinguishing attacks that succeed for genuinely different reasons, where ASR offers little insight. Across four aligned LLMs and three jailbreak paradigms, TLO demonstrates that attacks with nearly identical ASR values can exhibit clearly different temporal patterns. The geometry derived from TLO largely matches refusal-direction probes from hidden states, though one model highlighted a limitation of its fixed-lexicon approach. A simple early-stop rule based on TLO successfully cuts successful jailbreaks by over 50% without generating false alarms on benign queries, emphasizing the need to report when and how failures unfold.
Key takeaway
For AI Security Engineers evaluating LLM safety, relying solely on Attack Success Rate (ASR) is insufficient. You should integrate Temporal Logit Observability (TLO) to understand the temporal dynamics of jailbreaks, revealing how failures unfold, not just if they occur. This allows you to develop more nuanced defenses, such as TLO-derived early-stop rules, which cut successful jailbreaks by over 50% without impacting benign queries, significantly enhancing model robustness.
Key insights
Temporal Logit Observability (TLO) reveals the temporal dynamics of LLM safety failures, going beyond simple attack success rates.
Principles
- Attack success rate alone is insufficient.
- Failure paths are observable from logits.
- Different attacks have distinct temporal patterns.
Method
TLO observes a compliance-refusal margin during decoding, mapping model-attack conditions onto a calibrated 2D plane to reveal temporal failure patterns.
In practice
- Implement TLO for deeper jailbreak analysis.
- Derive early-stop rules from TLO for safety.
- Improve LLM safety evaluation beyond ASR.
Topics
- LLM Safety
- Jailbreak Detection
- Temporal Logit Observability
- Attack Success Rate
- Decoding Strategies
- AI Security
Best for: MLOps Engineer, Research Scientist, CTO, AI Scientist, AI Security Engineer, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.