An Empirical Study of Security Calibration in Large Language Models for Code

2026-02-03 · Source: cs.SE updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Cybersecurity & Data Privacy · Depth: Advanced, extended

Summary

An empirical study investigated security calibration in Large Language Models for code, evaluating GPT-4o-mini, Gemini-2.0-Flash, and Qwen3-Coder-Next across various temperature settings. The research utilized self-contained security tasks from the SALLM benchmark and multi-language repository-level contexts from AICGSecEval. Findings indicate widespread overconfidence among the evaluated LLMs, with functional calibration consistently worse than security calibration. For instance, GPT-4o-mini showed Expected Calibration Error (ECE) between 0.46 and 0.48, while Gemini-2.0-Flash ranged from 0.25 to 0.26. Calibration significantly degraded in repository-level settings, with GPT-4o-mini's ECE rising from 0.411 to 0.697. Execution gating proved the most effective mitigation, reducing ECE by 31-39%, whereas calibration-guided repair and prompt-level techniques offered limited improvements, often introducing functional regressions.

Key takeaway

For AI Security Engineers and developers evaluating LLM-generated code, do not treat model self-reported confidence as a security guarantee. Your confidence-based review gates will likely pass vulnerable code due to systematic overconfidence, especially in multi-file projects. Prioritize human review for all high-confidence outputs. Always conduct functional testing before any security self-assessment, as functional calibration is even worse. Elicit functional and security confidence separately for clearer signals.

Key insights

LLMs for code are systematically overconfident in security, especially in complex repository contexts, making their self-assessments unreliable.

Principles

LLM confidence often misaligns with empirical accuracy.
Functional calibration is worse than security calibration.
Repository context degrades calibration significantly.

Method

The study employed a multi-stage empirical evaluation combining SALLM (self-contained Python) and AICGSecEval (repository-level, multi-language) benchmarks. It measured ECE, Brier score, overconfidence gap, and False Trust rate, using verbalized confidence, token probabilities, sampling, and self-consistency.

In practice

Prioritize human review for high-confidence LLM code.
Conduct functional testing before security self-assessment.
Elicit functional and security confidence separately.

Topics

LLM Security
Code Generation
Model Calibration
False Trust
Vulnerability Remediation
Software Development

Code references

Best for: Research Scientist, CTO, VP of Engineering/Data, AI Scientist, AI Security Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.