Diff-Based Code Corruption using LLMs for Large-Scale Bugfix Benchmarking

· Source: cs.SE updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Advanced, extended

Summary

MegaBugFix is a new large-scale benchmark designed to evaluate Large Language Models' (LLMs) bugfixing capabilities, addressing limitations of existing benchmarks like small size, narrow bug diversity, and unrealistic bug types. It comprises 12,629 buggy Python programs, synthesized from correct code using a WizardCoder-13B-Python LLM fine-tuned for diff-based bug injection. This method, which generates code changes as diffs, avoids simplistic or unparsable bugs. The benchmark integrates programs from six diverse sources, including HumanEval and QuixBugs, and provides a unified pytest-based evaluation framework. Initial evaluations of 13 open-weight LLMs revealed consistently lower performance on MegaBugFix compared to established benchmarks, indicating it presents more challenging and realistic bugs. Further experiments showed that fine-tuning smaller LLMs on MegaBugFix significantly improved their bugfixing performance on other benchmarks.

Key takeaway

For AI Scientists and Machine Learning Engineers evaluating LLM bugfixing capabilities, you should integrate MegaBugFix into your benchmark suite. Its large scale and diff-generated, realistic bugs expose model weaknesses that simpler benchmarks miss, providing a more accurate assessment of real-world performance. Fine-tuning on this dataset can also significantly enhance your models' bugfixing abilities. Prioritize benchmarks that reflect diverse, complex bug types to ensure robust model development.

Key insights

Diff-based bug injection using fine-tuned LLMs creates large-scale, realistic benchmarks that challenge existing models.

Principles

Method

A WizardCoder-13B-Python LLM is fine-tuned with LoRA to generate diffs from correct Python programs. These diffs are applied to create buggy variants, which are then filtered for quality and integrated into a pytest-based evaluation framework.

In practice

Topics

Code references

Best for: AI Engineer, AI Scientist, Machine Learning Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.