SelfGrader: Stable Jailbreak Detection for Large Language Models using Token-Level Logits
Summary
SelfGrader is a novel, lightweight guardrail method designed to detect jailbreak attacks on Large Language Models (LLMs) by framing the problem as numerical grading using token-level logits. Unlike existing methods that incur high latency or suffer from text generation randomness, SelfGrader evaluates query safety within a compact set of numerical tokens (NTs), such as 0-9. It interprets the logit distribution over these NTs as an internal safety signal. The method employs a dual-perspective scoring rule, considering both maliciousness and benignness, to produce a stable and interpretable safety score that reduces false positives. Experiments on LLaMA-3-8B across various jailbreak benchmarks show SelfGrader achieves up to a 22.66% reduction in Attack Success Rate (ASR) while demonstrating significantly lower memory overhead (up to 173x) and latency (up to 26x) compared to state-of-the-art baselines.
Key takeaway
For AI/ML engineering teams deploying LLMs in production, SelfGrader offers a highly efficient and robust defense against jailbreak attacks. You should consider integrating this logit-based guardrail to significantly reduce Attack Success Rate (ASR) and False Positive Rate (FPR) without incurring substantial latency or memory overhead. Its stability across diverse attack types and low resource footprint make it ideal for latency-sensitive or resource-constrained environments, ensuring safer LLM interactions.
Key insights
SelfGrader uses token-level numerical logits and dual-perspective scoring for efficient, stable jailbreak detection in LLMs.
Principles
- Token-level logits offer finer-grained safety signals than generated text.
- Numerical token spaces provide invariant, task-aligned safety evaluation.
- Dual-perspective scoring (maliciousness and benignness) enhances stability.
Method
SelfGrader extracts NT-based logits, applies a Dual-Perspective Logit (DPL) scoring rule that combines maliciousness and benignness assessments, and then uses a threshold to make a binary guardrail decision.
In practice
- Use numerical tokens (0-9) for compact safety signal extraction.
- Implement in-context learning examples to align logit judgments.
- Balance maliciousness and benignness scores with a $\lambda=0.5$ coefficient.
Topics
- SelfGrader
- Jailbreak Detection
- LLM Security
- Token-Level Logits
- Guardrail Methods
Code references
Best for: CTO, VP of Engineering/Data, Director of AI/ML, AI Scientist, AI Security Engineer, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.