Towards Functional Correctness of Large Code Models with Selective Generation

· Source: cs.SE updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, extended

Summary

This paper introduces a novel approach to enhance the functional correctness of large code models by addressing code hallucination. The authors propose a "selective code generator" that leverages automatically generated unit tests via dynamic code analysis tools, specifically fuzzing methods like Atheris. This system defines "α-code entailment" to measure functional correctness, where a generated code snippet is considered correct if it satisfies a high probability (1-α) of matching the true code's input-output behavior. The learning algorithm for this selective generator provides a theoretical guarantee to control the false discovery rate (FDR-CE) of hallucinated code. The method, termed FuzzEval for evaluation, demonstrates superior performance over open and closed code generators, including GPT-4o, Gemini 1.5 Pro, and DeepSeek R1, across various datasets and programming languages, showing improved controllability of hallucination and selection efficiency.

Key takeaway

For AI Engineers developing or deploying code generation models, integrate selective generation to enhance functional correctness and trustworthiness. Leverage dynamic code analysis tools like fuzzing for automatic unit test generation, establishing a robust "α-code entailment" framework. This controls code hallucination rates with theoretical guarantees, improving reliability and reducing risks. Prioritize selecting appropriate scoring functions for optimal calibration.

Key insights

Code hallucination in LLMs can be controlled by a selective generator using fuzzing-generated unit tests and "α-code entailment".

Principles

Method

A selective generator learns a selection function s(x) = 1(f(x, G(x)) >= τ) by minimizing τ while bounding the FDR-CE using fuzzing-generated unit tests and a binomial tail bound.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.