EvalSafetyGap: A Hybrid Survey and Conceptual Framework for LLM Evaluation-Safety Failures

2025-06-05 · Source: cs.SE updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy, Data Science & Analytics · Depth: Expert, extended

Summary

EvalSafetyGap introduces a hybrid survey and conceptual framework to analyze Large Language Model (LLM) evaluation and AI safety failures, treating them as shared proxy-measurement problems under optimization pressure. The framework synthesizes eight evidence streams, including benchmark validity, dynamic evaluation, LLM-as-judge reliability, and jailbreak robustness, covering 2018–2026 research. It proposes EvalSafetyGap as an organizing hypothesis, utilizing Goodhart's Law, an Instability Decomposition, and an Alignment Trilemma. An exploratory 10-model audit reveals that the association between capability and sustained adversarial robustness is statistically indeterminate (Pearson r=+0.232, p=0.520), and the apparent open–closed safety gap is modest, primarily driven by governance and disclosure rather than behavioral robustness. The work provides a shared vocabulary and evidence map for dynamic evaluation and auditable alignment.

Key takeaway

For AI scientists and ML engineers evaluating LLM safety, you should critically assess benchmark validity and alignment robustness by distinguishing behavioral safety from governance and auditability. Do not rely solely on single-attempt attack success rates; instead, report multi-attempt budgets and use dynamic evaluation to counter proxy-target divergence. Your evaluation protocols must be transparent, detailing judges, threat models, and versioning to enable reproducible and meaningful cross-model comparisons.

Key insights

LLM evaluation and safety share a measurement problem: proxy metrics often diverge from true latent properties.

Principles

Goodhart's Law explains proxy-target divergence under optimization.
Safety evaluation is protocol-dependent, not a universal standard.
Governance and behavioral safety are distinct measurement dimensions.

Method

A hybrid survey combines systematic search, narrative synthesis, and a structured 10-model audit, organized by the EvalSafetyGap conceptual framework.

In practice

Report safety outcomes at pre-specified attempt budgets.
Supplement static benchmarks with dynamic, adversarial evaluation.
Separate behavioral safety from governance disclosure.

Topics

Large Language Models
AI Safety
LLM Evaluation
Benchmark Saturation
Goodhart's Law
Adversarial Robustness
Governance

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Security Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.