Enhancing Software Maintenance: A Learning to Rank Approach for Co-changed Method Identification

2007-02-20 · Source: cs.SE updates on arXiv.org · Field: Technology & Digital — Software Development & Engineering, Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, extended

Summary

A novel learning-to-rank (LtR) approach has been developed to accurately identify co-changed methods at the pull-request level, addressing limitations of prior commit-level methods. This approach integrates source code characteristics with historical code change data. Evaluated across 150 open-source Java projects, encompassing 41.5 million lines of code and 634,216 pull requests, the Random Forest model demonstrated superior performance. It surpassed other LtR models by 2.5%–12.8% in NDCG@5 and outperformed five baseline methods, including StarCoder 2, by 4.7%–537.5% in NDCG@5. The study also found that models trained on 90–180 days of historical data yielded more consistent results, with prediction accuracy declining after 60 days, indicating a need for bi-monthly retraining. Key predictive features include the number of co-changes, file path similarity, and author similarity.

Key takeaway

For software engineers managing complex systems, you should integrate learning-to-rank models to identify co-changed methods at the pull-request level. This enhances your ability to localize changes, reduce bugs, and improve code quality by highlighting implicit dependencies. Retrain your models bi-monthly, ideally using 90-180 days of historical data, to maintain prediction accuracy. Code reviewers and testers can also use this to prioritize inspections and test cases, ensuring comprehensive coverage.

Key insights

Learning-to-rank models effectively predict co-changed methods by combining static and historical code features at the pull-request level.

Principles

Co-change detection benefits from pull-request level analysis.
Historical data length impacts model consistency.
Model retraining is crucial for sustained accuracy.

Method

A learning-to-rank (LtR) approach combines source code characteristics (semantic, path, dependency, hierarchy, clone, argument similarities) with code change history (number of co-changes, author similarity) to predict and rank likely co-changes at the pull-request level.

In practice

Prioritize co-changed methods using LtR for complex dependencies.
Retrain co-change prediction models every 60 days.
Focus on co-change history, path, and author similarity features.

Topics

Software Maintenance
Learning to Rank
Co-change Detection
Random Forest
Pull Request Analysis
Software Repository Mining

Code references

apache/shenyu

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Software Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.