Safety Under Scaffolding: How Evaluation Conditions Shape Measured Safety

2026-06-04 · Source: cs.SE updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, extended

Summary

This study, involving N=62,808 observations across six frontier language models and four deployment configurations, reveals that evaluation conditions significantly shape measured AI safety. It found that map-reduce scaffolding degrades measured safety, with a Number Needed to Harm (NNH) of 14, meaning every fourteenth query produces an additional safety failure. However, this degradation primarily stems from evaluation format conversion, not alignment failure, as switching from multiple-choice (MC) to open-ended (OE) format on identical items shifts safety scores by 5–20 percentage points, a larger effect than any scaffold architecture. Within-format scaffold comparisons yield negligible effects (<2 pp). Sycophancy, the property with the lowest baseline safety rate (31.0%), shows the largest and most unpredictable model-scaffold interactions, spanning 35 percentage points. Overall, scaffold architecture explains only 0.4% of outcome variance, while benchmark choice explains 45 times more, rendering composite safety indices unreliable (G=0.000).

Key takeaway

For MLOps Engineers deploying LLMs in agentic systems, you must move beyond direct-API benchmark scores. Your evaluation protocols should mandate format-paired testing (MC and open-ended) and include structure-destroying scaffolds like map-reduce. Verify that safety-critical instructions propagate to all sub-calls. This approach distinguishes genuine alignment issues from measurement artifacts, providing actionable NNH metrics (e.g., NNH=14 for naive map-reduce) to accurately assess and mitigate deployment risks.

Key insights

Evaluation format, not scaffold architecture, is the primary driver of measured safety shifts in LLMs.

Principles

Scaffold effects are benchmark-specific, not generic.
Safety measurement is highly format-contingent.
Composite safety indices lack reliability due to interaction effects.

Method

The study employed pre-registration, assessor blinding, equivalence testing, and specification curve analysis across 384 analytic specifications to quantify scaffold and format effects on safety.

In practice

Propagate MC options to sub-calls in map-reduce.
Test sycophancy per-model, per-configuration.
Use NNH as an operational deployment-risk metric.

Topics

LLM Safety Evaluation
Agentic AI Systems
Evaluation Benchmarks
Scaffolding Architectures
Format Dependence
Sycophancy Resistance
Generalizability Theory

Code references

Best for: Research Scientist, AI Scientist, MLOps Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.