EvalSafetyGap: A Hybrid Survey and Conceptual Framework for LLM Evaluation-Safety Failures
Summary
EvalSafetyGap introduces a hybrid survey and conceptual framework to analyze Large Language Model (LLM) evaluation and AI safety failures, treating them as shared proxy-measurement problems under optimization pressure. The framework synthesizes eight evidence streams, including benchmark validity, dynamic evaluation, LLM-as-judge reliability, and jailbreak robustness, covering 2018–2026 research. It proposes EvalSafetyGap as an organizing hypothesis, utilizing Goodhart's Law, an Instability Decomposition, and an Alignment Trilemma. An exploratory 10-model audit reveals that the association between capability and sustained adversarial robustness is statistically indeterminate (Pearson r=+0.232, p=0.520), and the apparent open–closed safety gap is modest, primarily driven by governance and disclosure rather than behavioral robustness. The work provides a shared vocabulary and evidence map for dynamic evaluation and auditable alignment.
Key takeaway
For AI scientists and ML engineers evaluating LLM safety, you should critically assess benchmark validity and alignment robustness by distinguishing behavioral safety from governance and auditability. Do not rely solely on single-attempt attack success rates; instead, report multi-attempt budgets and use dynamic evaluation to counter proxy-target divergence. Your evaluation protocols must be transparent, detailing judges, threat models, and versioning to enable reproducible and meaningful cross-model comparisons.
Key insights
LLM evaluation and safety share a measurement problem: proxy metrics often diverge from true latent properties.
Principles
- Goodhart's Law explains proxy-target divergence under optimization.
- Safety evaluation is protocol-dependent, not a universal standard.
- Governance and behavioral safety are distinct measurement dimensions.
Method
A hybrid survey combines systematic search, narrative synthesis, and a structured 10-model audit, organized by the EvalSafetyGap conceptual framework.
In practice
- Report safety outcomes at pre-specified attempt budgets.
- Supplement static benchmarks with dynamic, adversarial evaluation.
- Separate behavioral safety from governance disclosure.
Topics
- Large Language Models
- AI Safety
- LLM Evaluation
- Benchmark Saturation
- Goodhart's Law
- Adversarial Robustness
- Governance
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Security Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.