Project-wise Comparison of Software Birthmarks Using Weighted Partial Similarity

2013-01-10 · Source: cs.SE updates on arXiv.org · Field: Technology & Digital — Software Development & Engineering, Cybersecurity & Data Privacy, Artificial Intelligence & Machine Learning · Depth: Expert, extended

Summary

A project-wise software birthmark comparison framework addresses challenges in detecting code plagiarism, particularly partial reuse and incidental similarity in large-scale software. This framework, based on symmetric aggregation of module-level similarities, introduces a weighting scheme that prioritizes larger modules and a partial similarity method focusing on the top fraction of highly similar module pairs. Evaluated on 35 open-source Java projects across ten categories, treating different versions as reuse cases, the method consistently outperformed existing approaches. It achieved robust and stable detection of partial code reuse at the project level, with optimal performance observed using small comparison scopes (1-5%), k-gram sizes of 3-4, and edit distance as the module-wise similarity function. The dataset and experimental artifacts are publicly available.

Key takeaway

For software engineers or research scientists tasked with detecting code reuse in large Java projects, you should adopt a project-wise birthmark comparison framework. Implement size-based weighting for modules and focus on the top 1-5% of highly similar module pairs, ideally using edit distance for module similarity. This approach significantly improves detection accuracy and stability, reducing false positives from small, generic modules and effectively identifying partial code reuse. Consider k-gram sizes of 3 or 4 for optimal performance.

Key insights

Project-wise software birthmark comparison requires weighting module similarities and focusing on top matches to detect partial reuse and mitigate incidental similarity.

Principles

Symmetric aggregation provides a strong baseline for project similarity.
Weighting modules by size reduces false positives from incidental similarity.
Partial similarity enhances reuse detection by focusing on relevant module pairs.

Method

The framework uses symmetric aggregation, applies logarithmic weighting to module birthmark sizes, and employs partial similarity by selecting the top α% of weighted module-pair similarities.

In practice

Filter external and small modules (LLC ≤ 30) for cleaner data.
Use k-gram sizes k ∈ {3,4} for balanced performance.
Prioritize edit distance for module-wise similarity calculations.

Topics

Software Birthmarks
Code Plagiarism Detection
Project-wise Similarity
Weighted Similarity
Partial Code Reuse
Java Software Analysis
Edit Distance

Code references

Best for: AI Scientist, Research Scientist, Software Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.