Think Harder and Don't Overlook Your Options: Revisiting Issue-Commit Linking with LLM-Assisted Retrieval

· Source: cs.SE updates on arXiv.org · Field: Technology & Digital — Software Development & Engineering, Artificial Intelligence & Machine Learning · Depth: Advanced, extended

Summary

This study revisits issue-commit linking, a critical aspect of software traceability, by evaluating various retrieval and reranking techniques. Researchers assessed established methods like BTLink, EasyLink, FRLink, RCLinker, and Hybrid-Linker, alongside modern deep learning and large language models (ChatGPT, Qwen, Gemma, Llama). The evaluation focused on efficiently retrieving relevant commits and then refining their ranking. Key findings indicate that dense retrieval methods outperform sparse approaches in identifying relevant commits, and combining dense and sparse retrieval (e.g., via Reciprocal Rank Fusion) can improve recall. Crucially, traditional machine learning-based reranking techniques, such as RCLinker, achieved higher performance than LLM-based approaches, suggesting that simpler models warrant careful consideration before adopting computationally expensive LLM solutions for large-scale issue-commit linking. The study used three datasets, including BTLink, EasyLink, and a new Apache dataset, with a hybrid temporal window capturing 97% of true links.

Key takeaway

For Machine Learning Engineers optimizing software traceability, prioritize evaluating traditional machine learning models and hybrid retrieval strategies before committing to expensive LLM-based solutions. Your team can achieve robust issue-commit linking performance by combining dense and sparse retrieval methods with a model like RCLinker, which leverages contextual metadata. This approach often yields superior accuracy and significantly lower computational costs, making it more practical for large-scale, real-time integration.

Key insights

Dense retrieval and traditional ML models often outperform LLMs for issue-commit linking, emphasizing simpler solutions.

Principles

Method

A two-stage pipeline involves temporal filtering to narrow candidate commits, followed by retrieval (dense, sparse, or hybrid) and then reranking using ML or LLM-based techniques.

In practice

Topics

Code references

Best for: AI Scientist, Research Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.