DualGauge: Automated Joint Security-Functionality Benchmarking of Specification-Only Code Generation by LLMs and Coding Agents
Summary
DualGauge is the first fully automated benchmarking framework designed to jointly evaluate the security and functional correctness of code generated by large language models (LLMs) and autonomous coding agents. Addressing the critical gap where existing benchmarks often separate these evaluations, DualGauge introduces DualGauge-Bench, a curated suite of 154 diverse coding tasks. Each task in DualGauge-Bench includes manually validated, coverage-enforced test suites for both security and functionality. The framework features an agentic program executor for sandboxed code execution and an LLM-based evaluator to assess outcomes against expected behaviors. Rigorous validation showed the executor achieved 95.08% precision and 84.67% recall, while the evaluator reached 90.54% precision and 77.91% recall. Benchmarking 21 leading LLMs revealed significant security-functionality gaps, non-linear security scaling with model size, and unexpected quantization effects.
Key takeaway
For Machine Learning Engineers and AI Security Engineers selecting or deploying LLMs for code generation, recognize that functional correctness does not equate to security. You should explicitly benchmark models for joint security and functionality, as larger models or standard instruction tuning do not guarantee improved security and can even degrade it. Prioritize models like GPT-5 with medium reasoning for critical applications, and investigate specific quantization methods like FP8, which unexpectedly improved security in some cases, to optimize for both performance and robust security.
Key insights
DualGauge offers the first automated framework for joint security-functionality benchmarking of LLM-generated code.
Principles
- Joint evaluation of LLM-generated code must simultaneously assess functional correctness and security.
- LLM security capabilities do not scale linearly with model size; larger models offer diminishing security returns.
Method
DualGauge employs an agentic executor for sandboxed code execution and dependency resolution, followed by an LLM-based evaluator for semantic assessment of functional and security test outcomes.
In practice
- Consider FP8 quantization for LLMs to potentially improve security while reducing inference costs.
- Prioritize models like GPT-5 with medium reasoning for high-security applications, as larger models don't guarantee better security.
Topics
- LLM Code Generation
- Automated Benchmarking
- Code Security
- Functional Correctness
- Model Quantization
- AI Software Engineering
Best for: Research Scientist, CTO, VP of Engineering/Data, AI Scientist, AI Security Engineer, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.