How to Compare the Security of Code Written by Humans to LLM-generated Code

2025-04-14 · Source: cs.SE updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Cybersecurity & Data Privacy · Depth: Advanced, extended

Summary

A new automated framework is proposed for empirically comparing the security of code generated by Large Language Models (LLMs) against human-written code, and hybrid approaches. This open-source framework addresses the lack of standardized methods for "species-fair" evaluations by automating prompt logging, timing, and experimental settings. It measures outcomes through multi-dimensional static and dynamic quality analysis, executed within isolated Podman containers to ensure reproducibility and environmental symmetry. A feasibility study validated the framework using 13 Python security and algorithmic challenges, comparing five OpenAI LLM variants (gpt-4.1, gpt-4o-mini, gpt-5.1, gpt-5-mini, gpt-5-nano) against human reference solutions. The study highlighted that exercise selection influenced correctness more than model choice and identified common failure modes like improper input handling and algorithmic errors.

Key takeaway

For AI Security Engineers and Research Scientists evaluating LLM-generated code, you should adopt standardized, "species-fair" frameworks to ensure reproducible and unbiased security comparisons. Prioritize frameworks that use containerization and multi-dimensional analysis, and be prepared for significant experimental attrition due to LLM non-determinism. This approach will help you isolate true security differences from experimental design artifacts.

Key insights

A "species-fair" framework enables reproducible, automated comparison of human and LLM-generated code security.

Principles

Ensure environmental symmetry for fair code evaluation.
Maintain instruction parity between human tasks and LLM prompts.
Map model capabilities to commensurate human experience levels.

Method

The framework automates prompt logging, timing, and settings, then executes human and LLM code in isolated Podman containers. It measures outcomes via multi-dimensional static (Ruff linter) and dynamic quality analysis.

In practice

Use Podman containers for isolated, reproducible code execution.
Enforce functional unit tests before security analysis.
Over-provision samples to account for LLM non-determinism.

Topics

LLM Code Security
Species-Fair Evaluation
Static Code Analysis
Dynamic Code Analysis
Containerization
Software Security Testing

Best for: CTO, VP of Engineering/Data, Director of AI/ML, AI Scientist, Research Scientist, AI Security Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.