WildCode Revisited: A Comprehensive Empirical Study on the Security of LLM-Generated Code

· Source: cs.SE updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy, Software Development & Engineering · Depth: Advanced, extended

Summary

A comprehensive empirical study, "WildCode Revisited," analyzed the security of code generated by ChatGPT using 82,843 real-world conversations extracted from the WildChat dataset (April 2023-May 2024). The research confirms that LLM-generated code frequently exhibits significant security vulnerabilities. Key findings include a 20.61% vulnerability rate for weak hash functions, 3.93% for SQL injection, and 14.85% of C/C++ programs containing memory safety issues. Notably, all 30 Java deserialization instances examined were vulnerable, and approximately one-third of regular expressions were susceptible to ReDoS attacks. The study also revealed that 14.4% of Python modules and 3.5% of JavaScript packages generated were "hallucinated" (non-existent). Furthermore, user intent analysis showed that "Secure Coding" was rarely prioritized in queries, even when users encountered buggy code, indicating a significant gap in security awareness.

Key takeaway

For software engineers integrating LLM-generated code, you must assume inherent security risks. Proactively scan all AI-produced code with static analysis tools like OpenGrep for vulnerabilities such as weak hashes, SQL injection, and memory safety issues. Do not rely on LLMs to self-correct or users to prompt for security. Explicitly request secure coding practices and verify all package dependencies to mitigate risks from hallucinated modules.

Key insights

LLM-generated code, particularly from ChatGPT, consistently contains significant security vulnerabilities, largely unaddressed by users.

Principles

Method

The study constructed a dataset from 82,843 real ChatGPT conversations containing code, then used OpenGrep with 648 rules for security analysis and zero-shot classification for user intent.

In practice

Topics

Code references

Best for: CTO, VP of Engineering/Data, Director of AI/ML, AI Security Engineer, AI Scientist, Software Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.