WildCode Revisited: A Comprehensive Empirical Study on the Security of LLM-Generated Code

2026-06-04 · Source: cs.SE updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy, Software Development & Engineering · Depth: Advanced, extended

Summary

A comprehensive empirical study, "WildCode Revisited," analyzed the security of code generated by ChatGPT using 82,843 real-world conversations extracted from the WildChat dataset (April 2023-May 2024). The research confirms that LLM-generated code frequently exhibits significant security vulnerabilities. Key findings include a 20.61% vulnerability rate for weak hash functions, 3.93% for SQL injection, and 14.85% of C/C++ programs containing memory safety issues. Notably, all 30 Java deserialization instances examined were vulnerable, and approximately one-third of regular expressions were susceptible to ReDoS attacks. The study also revealed that 14.4% of Python modules and 3.5% of JavaScript packages generated were "hallucinated" (non-existent). Furthermore, user intent analysis showed that "Secure Coding" was rarely prioritized in queries, even when users encountered buggy code, indicating a significant gap in security awareness.

Key takeaway

For software engineers integrating LLM-generated code, you must assume inherent security risks. Proactively scan all AI-produced code with static analysis tools like OpenGrep for vulnerabilities such as weak hashes, SQL injection, and memory safety issues. Do not rely on LLMs to self-correct or users to prompt for security. Explicitly request secure coding practices and verify all package dependencies to mitigate risks from hallucinated modules.

Key insights

LLM-generated code, particularly from ChatGPT, consistently contains significant security vulnerabilities, largely unaddressed by users.

Principles

LLM-generated code often lacks explicit security features.
User queries rarely prioritize security concerns.
Real-world interaction data reveals distinct LLM biases.

Method

The study constructed a dataset from 82,843 real ChatGPT conversations containing code, then used OpenGrep with 648 rules for security analysis and zero-shot classification for user intent.

In practice

Analyze LLM-generated code with static analysis tools.
Explicitly prompt LLMs for secure coding practices.
Validate all generated package imports against repositories.

Topics

LLM Code Security
ChatGPT
Static Code Analysis
Software Vulnerabilities
User Prompting
Hallucinated Modules

Code references

Best for: CTO, VP of Engineering/Data, Director of AI/ML, AI Security Engineer, AI Scientist, Software Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.