DualGauge: Automated Joint Security-Functionality Benchmarking of Specification-Only Code Generation by LLMs and Coding Agents

· Source: cs.SE updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Cybersecurity & Data Privacy · Depth: Expert, extended

Summary

DualGauge is the first fully automated benchmarking framework designed to jointly evaluate the security and functional correctness of code generated by large language models (LLMs) and autonomous coding agents. Addressing the critical gap where existing benchmarks often separate these evaluations, DualGauge introduces DualGauge-Bench, a curated suite of 154 diverse coding tasks. Each task in DualGauge-Bench includes manually validated, coverage-enforced test suites for both security and functionality. The framework features an agentic program executor for sandboxed code execution and an LLM-based evaluator to assess outcomes against expected behaviors. Rigorous validation showed the executor achieved 95.08% precision and 84.67% recall, while the evaluator reached 90.54% precision and 77.91% recall. Benchmarking 21 leading LLMs revealed significant security-functionality gaps, non-linear security scaling with model size, and unexpected quantization effects.

Key takeaway

For Machine Learning Engineers and AI Security Engineers selecting or deploying LLMs for code generation, recognize that functional correctness does not equate to security. You should explicitly benchmark models for joint security and functionality, as larger models or standard instruction tuning do not guarantee improved security and can even degrade it. Prioritize models like GPT-5 with medium reasoning for critical applications, and investigate specific quantization methods like FP8, which unexpectedly improved security in some cases, to optimize for both performance and robust security.

Key insights

DualGauge offers the first automated framework for joint security-functionality benchmarking of LLM-generated code.

Principles

Method

DualGauge employs an agentic executor for sandboxed code execution and dependency resolution, followed by an LLM-based evaluator for semantic assessment of functional and security test outcomes.

In practice

Topics

Best for: Research Scientist, CTO, VP of Engineering/Data, AI Scientist, AI Security Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.