Diff-Based Code Corruption using LLMs for Large-Scale Bugfix Benchmarking
Summary
MegaBugFix is a new large-scale benchmark designed to evaluate Large Language Models' (LLMs) bugfixing capabilities, addressing limitations of existing benchmarks like small size, narrow bug diversity, and unrealistic bug types. It comprises 12,629 buggy Python programs, synthesized from correct code using a WizardCoder-13B-Python LLM fine-tuned for diff-based bug injection. This method, which generates code changes as diffs, avoids simplistic or unparsable bugs. The benchmark integrates programs from six diverse sources, including HumanEval and QuixBugs, and provides a unified pytest-based evaluation framework. Initial evaluations of 13 open-weight LLMs revealed consistently lower performance on MegaBugFix compared to established benchmarks, indicating it presents more challenging and realistic bugs. Further experiments showed that fine-tuning smaller LLMs on MegaBugFix significantly improved their bugfixing performance on other benchmarks.
Key takeaway
For AI Scientists and Machine Learning Engineers evaluating LLM bugfixing capabilities, you should integrate MegaBugFix into your benchmark suite. Its large scale and diff-generated, realistic bugs expose model weaknesses that simpler benchmarks miss, providing a more accurate assessment of real-world performance. Fine-tuning on this dataset can also significantly enhance your models' bugfixing abilities. Prioritize benchmarks that reflect diverse, complex bug types to ensure robust model development.
Key insights
Diff-based bug injection using fine-tuned LLMs creates large-scale, realistic benchmarks that challenge existing models.
Principles
- Diff generation for code corruption ensures actual modification and avoids simplistic bugs.
- Diverse program sources and bug types are crucial for robust benchmark design.
- LLM fine-tuning on realistic synthetic bugs improves general bugfixing capabilities.
Method
A WizardCoder-13B-Python LLM is fine-tuned with LoRA to generate diffs from correct Python programs. These diffs are applied to create buggy variants, which are then filtered for quality and integrated into a pytest-based evaluation framework.
In practice
- Use diff-based generation for synthetic code corruption tasks.
- Integrate pytest and Docker for consistent benchmark evaluation environments.
- Consider MegaBugFix for more rigorous LLM bugfixing evaluations.
Topics
- Large Language Models
- Code Bugfixing
- Benchmark Datasets
- Diff Generation
- Program Corruption
- Python Programming
Code references
- TheAlgorithms/Python
- bigcode-project/bigcode-evaluation-harness
- ibm-granite/granite-3.0-language-models
- pytest-dev/pytest
- psf/black
Best for: AI Engineer, AI Scientist, Machine Learning Engineer, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.