Gradient Regularized Newton Boosting Trees with Global Convergence
Summary
This paper introduces Restricted Newton Descent to analyze the global convergence of Gradient Boosting Decision Trees (GBDTs), specifically focusing on vanilla Newton boosting and a novel gradient regularized Newton boosting scheme. GBDTs, including implementations like XGBoost and LightGBM, rely on Newton boosting, a second-order descent method, but its global convergence has been poorly understood. The authors prove that vanilla Newton boosting achieves a linear convergence rate for smooth, strongly convex losses under a Hessian-dominance condition. For general convex losses with Lipschitz Hessians, they extend a gradient regularized Newton (GRN) scheme, which minimally modifies the classical algorithm by adding an adaptive $\ell_{2}$-regularization term proportional to the square root of the gradient norm. This GRN scheme achieves a global $\mathcal{O}(\frac{1}{k^{2}})$ convergence rate, matching first-order boosting with Nesterov momentum, and demonstrates convergence in numerical experiments where vanilla Newton boosting may diverge, particularly with Charbonnier loss.
Key takeaway
For Research Scientists developing or deploying GBDT models, understanding the theoretical convergence guarantees is crucial. The Gradient Regularized Newton (GRN) scheme offers a robust alternative to vanilla Newton boosting, providing global $\mathcal{O}(\frac{1}{k^{2}})$ convergence for general convex losses. You should consider implementing GRN, which incurs negligible computational overhead, especially when working with loss functions prone to divergence or when seeking stronger theoretical guarantees for model stability and performance.
Key insights
A gradient-regularized Newton boosting scheme achieves global convergence for GBDTs, addressing prior theoretical gaps.
Principles
- Newton boosting can achieve linear convergence for Hessian-dominated losses.
- Adaptive $\ell_{2}$-regularization enables global convergence for general convex losses.
- Regularity of base loss functions translates to the boosting objective.
Method
The Gradient Regularized Newton (GRN) scheme modifies Newton boosting by adding an adaptive $\ell_{2}$-regularization term, $\lambda_{k}=\lambda_{\text{base}}+\sqrt{M\|g_{k}\|}$, to the objective function at each iteration.
In practice
- Implement GRN by adding an adaptive $\ell_{2}$-regularization term.
- Compute the adaptive term using the gradient norm and a Lipschitz constant.
- Increasing tree depth improves weak learner conditions (cosine angle, weak gradient edge).
Topics
- Gradient Boosting Decision Trees
- Newton Boosting
- Gradient Regularized Newton
- Global Convergence Theory
- Hilbert Space Optimization
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Student
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by stat.ML updates on arXiv.org.