EvalSafetyGap: A Hybrid Survey and Conceptual Framework for LLM Evaluation-Safety Failures

· Source: cs.SE updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy, Data Science & Analytics · Depth: Expert, extended

Summary

EvalSafetyGap introduces a hybrid survey and conceptual framework to analyze Large Language Model (LLM) evaluation and AI safety failures, treating them as shared proxy-measurement problems under optimization pressure. The framework synthesizes eight evidence streams, including benchmark validity, dynamic evaluation, LLM-as-judge reliability, and jailbreak robustness, covering 2018–2026 research. It proposes EvalSafetyGap as an organizing hypothesis, utilizing Goodhart's Law, an Instability Decomposition, and an Alignment Trilemma. An exploratory 10-model audit reveals that the association between capability and sustained adversarial robustness is statistically indeterminate (Pearson r=+0.232, p=0.520), and the apparent open–closed safety gap is modest, primarily driven by governance and disclosure rather than behavioral robustness. The work provides a shared vocabulary and evidence map for dynamic evaluation and auditable alignment.

Key takeaway

For AI scientists and ML engineers evaluating LLM safety, you should critically assess benchmark validity and alignment robustness by distinguishing behavioral safety from governance and auditability. Do not rely solely on single-attempt attack success rates; instead, report multi-attempt budgets and use dynamic evaluation to counter proxy-target divergence. Your evaluation protocols must be transparent, detailing judges, threat models, and versioning to enable reproducible and meaningful cross-model comparisons.

Key insights

LLM evaluation and safety share a measurement problem: proxy metrics often diverge from true latent properties.

Principles

Method

A hybrid survey combines systematic search, narrative synthesis, and a structured 10-model audit, organized by the EvalSafetyGap conceptual framework.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Security Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.