SelfGrader: Stable Jailbreak Detection for Large Language Models using Token-Level Logits

2026-04-16 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy · Depth: Expert, extended

Summary

SelfGrader is a novel, lightweight guardrail method designed to detect jailbreak attacks on Large Language Models (LLMs) by framing the problem as numerical grading using token-level logits. Unlike existing methods that incur high latency or suffer from text generation randomness, SelfGrader evaluates query safety within a compact set of numerical tokens (NTs), such as 0-9. It interprets the logit distribution over these NTs as an internal safety signal. The method employs a dual-perspective scoring rule, considering both maliciousness and benignness, to produce a stable and interpretable safety score that reduces false positives. Experiments on LLaMA-3-8B across various jailbreak benchmarks show SelfGrader achieves up to a 22.66% reduction in Attack Success Rate (ASR) while demonstrating significantly lower memory overhead (up to 173x) and latency (up to 26x) compared to state-of-the-art baselines.

Key takeaway

For AI/ML engineering teams deploying LLMs in production, SelfGrader offers a highly efficient and robust defense against jailbreak attacks. You should consider integrating this logit-based guardrail to significantly reduce Attack Success Rate (ASR) and False Positive Rate (FPR) without incurring substantial latency or memory overhead. Its stability across diverse attack types and low resource footprint make it ideal for latency-sensitive or resource-constrained environments, ensuring safer LLM interactions.

Key insights

SelfGrader uses token-level numerical logits and dual-perspective scoring for efficient, stable jailbreak detection in LLMs.

Principles

Token-level logits offer finer-grained safety signals than generated text.
Numerical token spaces provide invariant, task-aligned safety evaluation.
Dual-perspective scoring (maliciousness and benignness) enhances stability.

Method

SelfGrader extracts NT-based logits, applies a Dual-Perspective Logit (DPL) scoring rule that combines maliciousness and benignness assessments, and then uses a threshold to make a binary guardrail decision.

In practice

Use numerical tokens (0-9) for compact safety signal extraction.
Implement in-context learning examples to align logit judgments.
Balance maliciousness and benignness scores with a $\lambda=0.5$ coefficient.

Topics

SelfGrader
Jailbreak Detection
LLM Security
Token-Level Logits
Guardrail Methods

Code references

tatsu-lab/alpaca_eval

Best for: CTO, VP of Engineering/Data, Director of AI/ML, AI Scientist, AI Security Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.