DualGauge: Automated Joint Security-Functionality Benchmarking of Specification-Only Code Generation by LLMs and Coding Agents

2026-06-16 · Source: cs.SE updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Cybersecurity & Data Privacy · Depth: Expert, extended

Summary

DualGauge is the first fully automated benchmarking framework designed to jointly evaluate the security and functional correctness of code generated by large language models (LLMs) and autonomous coding agents. Addressing the critical gap where existing benchmarks often separate these evaluations, DualGauge introduces DualGauge-Bench, a curated suite of 154 diverse coding tasks. Each task in DualGauge-Bench includes manually validated, coverage-enforced test suites for both security and functionality. The framework features an agentic program executor for sandboxed code execution and an LLM-based evaluator to assess outcomes against expected behaviors. Rigorous validation showed the executor achieved 95.08% precision and 84.67% recall, while the evaluator reached 90.54% precision and 77.91% recall. Benchmarking 21 leading LLMs revealed significant security-functionality gaps, non-linear security scaling with model size, and unexpected quantization effects.

Key takeaway

For Machine Learning Engineers and AI Security Engineers selecting or deploying LLMs for code generation, recognize that functional correctness does not equate to security. You should explicitly benchmark models for joint security and functionality, as larger models or standard instruction tuning do not guarantee improved security and can even degrade it. Prioritize models like GPT-5 with medium reasoning for critical applications, and investigate specific quantization methods like FP8, which unexpectedly improved security in some cases, to optimize for both performance and robust security.

Key insights

DualGauge offers the first automated framework for joint security-functionality benchmarking of LLM-generated code.

Principles

Joint evaluation of LLM-generated code must simultaneously assess functional correctness and security.
LLM security capabilities do not scale linearly with model size; larger models offer diminishing security returns.

Method

DualGauge employs an agentic executor for sandboxed code execution and dependency resolution, followed by an LLM-based evaluator for semantic assessment of functional and security test outcomes.

In practice

Consider FP8 quantization for LLMs to potentially improve security while reducing inference costs.
Prioritize models like GPT-5 with medium reasoning for high-security applications, as larger models don't guarantee better security.

Topics

LLM Code Generation
Automated Benchmarking
Code Security
Functional Correctness
Model Quantization
AI Software Engineering

Best for: Research Scientist, CTO, VP of Engineering/Data, AI Scientist, AI Security Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.