Deforking the World of Code: A Project-Provenance Map that Recovers Cross-Forge Fork Families that Platform Graphs Cannot See

· Source: cs.SE updates on arXiv.org · Field: Technology & Digital — Software Development & Engineering, Data Science & Analytics, Artificial Intelligence & Machine Learning · Depth: Expert, extended

Summary

A new deforking map for the World of Code (WoC) V2604, named p2PFull, has been released, which collapses raw repositories into deforked projects based on shared Git commit history. This map addresses the inflation of spread-based measures caused by forks. The construction utilizes a hub-node star encoding and parallel Louvain clustering over 51.79M shared-commit groups. A key finding is that naive shared-history union over-merges, creating giant clusters, with the largest uncapped cluster containing 861,948 repositories. A C=250 size cap effectively removes boilerplate-hub bridges, shrinking the largest community by 78.7% to 183,654 repositories and restoring 769% of the author spread signal. Validation against GitHub's declared fork graph from GHArchive ForkEvents shows 99.01% edge agreement for repositories present in WoC. Crucially, the map identifies cross-forge fork families (5.41%) and families rooted off GitHub (1.51%) that platform-specific graphs cannot see. The authors also release a refreshed fork-exclusion list (134.1M children) and a detached-fork inventory (455,550 hard-detached edges).

Key takeaway

For data scientists analyzing global software repository data, traditional platform-based fork graphs are incomplete and inflate metrics. You should adopt the new p2PFull.V2604.cap250.s deforking map to accurately identify true project families. This map corrects for over-merging and reveals cross-forge and detached forks, providing a more precise foundation for spread- and popularity-based analyses. Integrate the refreshed 134.1M fork-exclusion list to optimize data collection.

Key insights

Commit-based deforking accurately maps software provenance, revealing cross-forge and detached fork families invisible to platform-specific graphs.

Principles

Method

The deforking map is built by encoding 51.79M shared-commit groups as a hub-node star graph, then applying parallel Louvain clustering with a C=250 size cap to define projects.

In practice

Topics

Best for: AI Scientist, Research Scientist, Data Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.