How Reliable Is Your Jailbreak Judge? Calibration and Adversarial Robustness of Automated ASR Scoring
Summary
A study on the reliability of automated judges for LLM jailbreak attack-success rates (ASR) reveals significant inconsistencies and vulnerabilities. Researchers compared dedicated safety classifiers and LLM-as-judges against 596 human-labeled completions from the HarmBench validation set. Dedicated classifiers exhibited high recall (0.974) but lower precision (0.835), over-flagging harmful content. They resisted surface attacks (at most 6.7% flip rate) but were susceptible to white-box GCG attacks, flipping 70% of confident true positives (21 of 30). Conversely, LLM-as-judges showed high precision (0.81 to 0.94) but highly erratic recall (0.06 to 0.65), leading to varied ASRs depending on the judge. These LLM-judges were also highly vulnerable to benign framing wrappers, with 57% to 100% of responses flipped, often by a single prepended refusal sentence (39% to 88%). An audit confirmed that flipped responses still contained harmful content. The findings indicate that many reported ASRs, especially those from LLM-judges, are unreliable.
Key takeaway
For AI Security Engineers and researchers evaluating LLM safety, you must critically assess the reliability of your automated jailbreak judges. Given that LLM-as-judges are highly susceptible to simple framing attacks and dedicated classifiers can be white-box attacked, your reported Attack Success Rates (ASRs) may be significantly inaccurate. You should report judge precision and recall on human-labeled data, correct ASRs for judge precision, and integrate adversarial checks into your evaluation pipeline to ensure robust and trustworthy safety assessments.
Key insights
Automated LLM jailbreak judges, particularly LLM-as-judges, are unreliable and vulnerable to adversarial manipulation, leading to inaccurate attack-success rates.
Principles
- Automated judges have distinct failure modes.
- Judge reliability impacts reported ASR accuracy.
- Adversarial robustness is critical for evaluation.
Method
The study compared dedicated safety classifiers and LLM-as-judges against 596 human-labeled completions, then subjected them to surface and white-box adversarial attacks to assess calibration and robustness.
In practice
- Report judge precision and recall.
- Correct ASR for judge precision.
- Include adversarial judge checks.
Topics
- LLM Jailbreaks
- Prompt Injection
- Attack Success Rate
- Automated Evaluation
- Adversarial Robustness
- Safety Classifiers
- LLM-as-a-Judge
Code references
Best for: Research Scientist, AI Engineer, NLP Engineer, AI Scientist, AI Security Engineer, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.