On the Reproducibility of Quantum Software Defect Datasets: A Case Study of Bugs4Q

· Source: cs.SE updates on arXiv.org · Field: Technology & Digital — Software Development & Engineering, Artificial Intelligence & Machine Learning, Emerging Technologies & Innovation · Depth: Expert, extended

Summary

A study investigated the reproducibility of Bugs4Q, a widely used quantum software defect dataset, across 77,700 quantum program executions of 37 artifacts over 21 Qiskit core-library versions spanning three years. The research found that Bugs4Q's reproducibility sharply declined from 62.2% on Qiskit v0.20.1 to 16.2% on v2.3.1, the latest version as of April 1, 2026, with 83.8% of artifacts experiencing reproduction failures at least once. Manual analysis of 543 failures revealed that 93.6% were dependency-related, consistent with classical software defect datasets. However, a key difference emerged: only 4.6% of Bugs4Q's failures could be resolved by dependency updates alone, with the majority requiring source-code modifications. Based on these findings, the researchers curated Bugs4Q-Robust, a patched version that increased reproducibility to 78.4% on Qiskit v2.3.1, demonstrating the need for continuous source-level maintenance in evolving quantum ecosystems.

Key takeaway

For research scientists and software engineers relying on quantum defect datasets like Bugs4Q, you must account for rapid API evolution. Your studies should document the exact Qiskit version and environment used, as reproducibility significantly degrades over time, often requiring source-code modifications, not just dependency pinning. Consider contributing to or utilizing continuously maintained, patched datasets like Bugs4Q-Robust to ensure the validity and comparability of your results.

Key insights

Quantum software defect datasets face severe reproducibility degradation due to rapid ecosystem evolution, often requiring source-level patches.

Principles

Method

The study conducted an operational replication using Bugs4Q, analyzing 77,700 executions across 21 Qiskit versions, classifying 543 failures, and curating a patched dataset, Bugs4Q-Robust.

In practice

Topics

Code references

Best for: AI Scientist, Research Scientist, Software Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.