Diff-Based Code Corruption using LLMs for Large-Scale Bugfix Benchmarking

2024-09-05 · Source: cs.SE updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Advanced, extended

Summary

MegaBugFix is a new large-scale benchmark designed to evaluate Large Language Models' (LLMs) bugfixing capabilities, addressing limitations of existing benchmarks like small size, narrow bug diversity, and unrealistic bug types. It comprises 12,629 buggy Python programs, synthesized from correct code using a WizardCoder-13B-Python LLM fine-tuned for diff-based bug injection. This method, which generates code changes as diffs, avoids simplistic or unparsable bugs. The benchmark integrates programs from six diverse sources, including HumanEval and QuixBugs, and provides a unified pytest-based evaluation framework. Initial evaluations of 13 open-weight LLMs revealed consistently lower performance on MegaBugFix compared to established benchmarks, indicating it presents more challenging and realistic bugs. Further experiments showed that fine-tuning smaller LLMs on MegaBugFix significantly improved their bugfixing performance on other benchmarks.

Key takeaway

For AI Scientists and Machine Learning Engineers evaluating LLM bugfixing capabilities, you should integrate MegaBugFix into your benchmark suite. Its large scale and diff-generated, realistic bugs expose model weaknesses that simpler benchmarks miss, providing a more accurate assessment of real-world performance. Fine-tuning on this dataset can also significantly enhance your models' bugfixing abilities. Prioritize benchmarks that reflect diverse, complex bug types to ensure robust model development.

Key insights

Diff-based bug injection using fine-tuned LLMs creates large-scale, realistic benchmarks that challenge existing models.

Principles

Diff generation for code corruption ensures actual modification and avoids simplistic bugs.
Diverse program sources and bug types are crucial for robust benchmark design.
LLM fine-tuning on realistic synthetic bugs improves general bugfixing capabilities.

Method

A WizardCoder-13B-Python LLM is fine-tuned with LoRA to generate diffs from correct Python programs. These diffs are applied to create buggy variants, which are then filtered for quality and integrated into a pytest-based evaluation framework.

In practice

Use diff-based generation for synthetic code corruption tasks.
Integrate pytest and Docker for consistent benchmark evaluation environments.
Consider MegaBugFix for more rigorous LLM bugfixing evaluations.

Topics

Large Language Models
Code Bugfixing
Benchmark Datasets
Diff Generation
Program Corruption
Python Programming

Code references

Best for: AI Engineer, AI Scientist, Machine Learning Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.