Code Lifespan Survival Analysis (CLSA): Predicting the Survival of Source Code Lines Using AST-Aware Mining

· Source: cs.SE updates on arXiv.org · Field: Technology & Digital — Software Development & Engineering, Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, extended

Summary

The Code Lifespan Survival Analysis (CLSA) framework models the deletion risk of individual source code lines, a finer granularity than previous approaches. It analyzes 32.5 million line birth events from 120 active open-source TypeScript repositories. A 5-stage bipartite matching pipeline prevented 8.3 million false death classifications, ensuring unbiased survival estimation. CLSA uses a Cox Proportional Hazards model with 15 statically computable covariates, including AST structure and line Shannon entropy. Results show over half of all lines are never deleted, with a median lifespan of 95.7 days for deleted lines, indicating an early "stabilize-or-die" phase. Covariate effects are time-varying, organizing into a three-regime structure. Line Shannon entropy is a strong protective factor (HR = 0.84 for new code, 0.36 for mature code), while lines in conditional branches become risk factors after 90 days (HR = 1.21). Repository identity is the dominant predictive factor, improving concordance from 0.586 to 0.666 with a shared gamma frailty model (variance θ = 1.449).

Key takeaway

For software engineers and AI scientists focused on code quality and maintenance, you should integrate line-level survival predictions into your workflows. Leverage statically computable features like line Shannon entropy and AST context (e.g., "in_function", "ast_group_expression") to identify high-risk code. Consider developing IDE plugins for live risk scoring or automated pull-request annotators, and always calibrate risk models with repository-specific context to avoid mis-prioritizing refactoring efforts.

Key insights

Predicting individual source code line deletion risk is feasible using static features and survival analysis, revealing time-varying effects.

Principles

Method

A 5-stage bipartite matching pipeline purifies deletion events, followed by Cox Proportional Hazards and shared gamma frailty models for survival estimation.

In practice

Topics

Best for: AI Scientist, Research Scientist, Software Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.