Deforking the World of Code: A Project-Provenance Map that Recovers Cross-Forge Fork Families that Platform Graphs Cannot See
Summary
A new deforking map for the World of Code (WoC) V2604, named p2PFull, has been released, which collapses raw repositories into deforked projects based on shared Git commit history. This map addresses the inflation of spread-based measures caused by forks. The construction utilizes a hub-node star encoding and parallel Louvain clustering over 51.79M shared-commit groups. A key finding is that naive shared-history union over-merges, creating giant clusters, with the largest uncapped cluster containing 861,948 repositories. A C=250 size cap effectively removes boilerplate-hub bridges, shrinking the largest community by 78.7% to 183,654 repositories and restoring 769% of the author spread signal. Validation against GitHub's declared fork graph from GHArchive ForkEvents shows 99.01% edge agreement for repositories present in WoC. Crucially, the map identifies cross-forge fork families (5.41%) and families rooted off GitHub (1.51%) that platform-specific graphs cannot see. The authors also release a refreshed fork-exclusion list (134.1M children) and a detached-fork inventory (455,550 hard-detached edges).
Key takeaway
For data scientists analyzing global software repository data, traditional platform-based fork graphs are incomplete and inflate metrics. You should adopt the new p2PFull.V2604.cap250.s deforking map to accurately identify true project families. This map corrects for over-merging and reveals cross-forge and detached forks, providing a more precise foundation for spread- and popularity-based analyses. Integrate the refreshed 134.1M fork-exclusion list to optimize data collection.
Key insights
Commit-based deforking accurately maps software provenance, revealing cross-forge and detached fork families invisible to platform-specific graphs.
Principles
- Shared commit history defines true fork families.
- Naive shared-history union leads to over-merging.
- Commit-based provenance transcends platform metadata.
Method
The deforking map is built by encoding 51.79M shared-commit groups as a hub-node star graph, then applying parallel Louvain clustering with a C=250 size cap to define projects.
In practice
- Use the p2PFull.V2604.cap250.s map for WoC analyses.
- Exclude 134.1M repositories using the refreshed list.
- Identify cross-forge projects via commit-based clustering.
Topics
- Deforking
- Software Provenance
- World of Code
- Git Commit History
- Graph Clustering
- Cross-Forge Analysis
Best for: AI Scientist, Research Scientist, Data Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.