Code Lifespan Survival Analysis (CLSA): Predicting the Survival of Source Code Lines Using AST-Aware Mining
Summary
The Code Lifespan Survival Analysis (CLSA) framework models the deletion risk of individual source code lines, a finer granularity than previous approaches. It analyzes 32.5 million line birth events from 120 active open-source TypeScript repositories. A 5-stage bipartite matching pipeline prevented 8.3 million false death classifications, ensuring unbiased survival estimation. CLSA uses a Cox Proportional Hazards model with 15 statically computable covariates, including AST structure and line Shannon entropy. Results show over half of all lines are never deleted, with a median lifespan of 95.7 days for deleted lines, indicating an early "stabilize-or-die" phase. Covariate effects are time-varying, organizing into a three-regime structure. Line Shannon entropy is a strong protective factor (HR = 0.84 for new code, 0.36 for mature code), while lines in conditional branches become risk factors after 90 days (HR = 1.21). Repository identity is the dominant predictive factor, improving concordance from 0.586 to 0.666 with a shared gamma frailty model (variance θ = 1.449).
Key takeaway
For software engineers and AI scientists focused on code quality and maintenance, you should integrate line-level survival predictions into your workflows. Leverage statically computable features like line Shannon entropy and AST context (e.g., "in_function", "ast_group_expression") to identify high-risk code. Consider developing IDE plugins for live risk scoring or automated pull-request annotators, and always calibrate risk models with repository-specific context to avoid mis-prioritizing refactoring efforts.
Key insights
Predicting individual source code line deletion risk is feasible using static features and survival analysis, revealing time-varying effects.
Principles
- Code deletion risk varies significantly at the individual line level.
- Code exhibits a "stabilize-or-die" pattern, with early deletion risk.
- Covariate effects on code survival are dynamic, following distinct time regimes.
Method
A 5-stage bipartite matching pipeline purifies deletion events, followed by Cox Proportional Hazards and shared gamma frailty models for survival estimation.
In practice
- Implement IDE plugins for real-time code survival risk scoring.
- Calibrate risk tools using repository-specific deletion rates.
- Automate code review prioritization for high-hazard lines.
Topics
- Survival Analysis
- Code Evolution
- Abstract Syntax Tree
- TypeScript
- Code Quality
- Risk Prediction
- Software Maintenance
Best for: AI Scientist, Research Scientist, Software Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.