Towards Functional Correctness of Large Code Models with Selective Generation
Summary
This paper introduces a novel approach to enhance the functional correctness of large code models by addressing code hallucination. The authors propose a "selective code generator" that leverages automatically generated unit tests via dynamic code analysis tools, specifically fuzzing methods like Atheris. This system defines "α-code entailment" to measure functional correctness, where a generated code snippet is considered correct if it satisfies a high probability (1-α) of matching the true code's input-output behavior. The learning algorithm for this selective generator provides a theoretical guarantee to control the false discovery rate (FDR-CE) of hallucinated code. The method, termed FuzzEval for evaluation, demonstrates superior performance over open and closed code generators, including GPT-4o, Gemini 1.5 Pro, and DeepSeek R1, across various datasets and programming languages, showing improved controllability of hallucination and selection efficiency.
Key takeaway
For AI Engineers developing or deploying code generation models, integrate selective generation to enhance functional correctness and trustworthiness. Leverage dynamic code analysis tools like fuzzing for automatic unit test generation, establishing a robust "α-code entailment" framework. This controls code hallucination rates with theoretical guarantees, improving reliability and reducing risks. Prioritize selecting appropriate scoring functions for optimal calibration.
Key insights
Code hallucination in LLMs can be controlled by a selective generator using fuzzing-generated unit tests and "α-code entailment".
Principles
- Code's executable nature enables automatic unit test generation.
- "α-code entailment" defines functional correctness via probabilistic execution.
- Selective generation can control hallucination rates with theoretical guarantees.
Method
A selective generator learns a selection function s(x) = 1(f(x, G(x)) >= τ) by minimizing τ while bounding the FDR-CE using fuzzing-generated unit tests and a binomial tail bound.
In practice
- Use fuzzing tools to generate extensive unit tests for code evaluation.
- Implement "α-code entailment" to quantify functional correctness.
- Apply selective generation to filter out uncertain code outputs.
Topics
- Code Generation
- Large Language Models
- Code Hallucination
- Fuzzing
- Unit Test Generation
- Selective Prediction
- Functional Correctness
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.