How Reliable Is Your Jailbreak Judge? Calibration and Adversarial Robustness of Automated ASR Scoring

· Source: Takara TLDR - Daily AI Papers · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy · Depth: Expert, medium

Summary

A study on the reliability of automated judges for LLM jailbreak attack-success rates (ASR) reveals significant inconsistencies and vulnerabilities. Researchers compared dedicated safety classifiers and LLM-as-judges against 596 human-labeled completions from the HarmBench validation set. Dedicated classifiers exhibited high recall (0.974) but lower precision (0.835), over-flagging harmful content. They resisted surface attacks (at most 6.7% flip rate) but were susceptible to white-box GCG attacks, flipping 70% of confident true positives (21 of 30). Conversely, LLM-as-judges showed high precision (0.81 to 0.94) but highly erratic recall (0.06 to 0.65), leading to varied ASRs depending on the judge. These LLM-judges were also highly vulnerable to benign framing wrappers, with 57% to 100% of responses flipped, often by a single prepended refusal sentence (39% to 88%). An audit confirmed that flipped responses still contained harmful content. The findings indicate that many reported ASRs, especially those from LLM-judges, are unreliable.

Key takeaway

For AI Security Engineers and researchers evaluating LLM safety, you must critically assess the reliability of your automated jailbreak judges. Given that LLM-as-judges are highly susceptible to simple framing attacks and dedicated classifiers can be white-box attacked, your reported Attack Success Rates (ASRs) may be significantly inaccurate. You should report judge precision and recall on human-labeled data, correct ASRs for judge precision, and integrate adversarial checks into your evaluation pipeline to ensure robust and trustworthy safety assessments.

Key insights

Automated LLM jailbreak judges, particularly LLM-as-judges, are unreliable and vulnerable to adversarial manipulation, leading to inaccurate attack-success rates.

Principles

Method

The study compared dedicated safety classifiers and LLM-as-judges against 596 human-labeled completions, then subjected them to surface and white-box adversarial attacks to assess calibration and robustness.

In practice

Topics

Code references

Best for: Research Scientist, AI Engineer, NLP Engineer, AI Scientist, AI Security Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.