Gradient Regularized Newton Boosting Trees with Global Convergence

2026-05-04 · Source: stat.ML updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Mathematics & Computational Sciences · Depth: Expert, extended

Summary

This paper introduces Restricted Newton Descent to analyze the global convergence of Gradient Boosting Decision Trees (GBDTs), specifically focusing on vanilla Newton boosting and a novel gradient regularized Newton boosting scheme. GBDTs, including implementations like XGBoost and LightGBM, rely on Newton boosting, a second-order descent method, but its global convergence has been poorly understood. The authors prove that vanilla Newton boosting achieves a linear convergence rate for smooth, strongly convex losses under a Hessian-dominance condition. For general convex losses with Lipschitz Hessians, they extend a gradient regularized Newton (GRN) scheme, which minimally modifies the classical algorithm by adding an adaptive $\ell_{2}$-regularization term proportional to the square root of the gradient norm. This GRN scheme achieves a global $\mathcal{O}(\frac{1}{k^{2}})$ convergence rate, matching first-order boosting with Nesterov momentum, and demonstrates convergence in numerical experiments where vanilla Newton boosting may diverge, particularly with Charbonnier loss.

Key takeaway

For Research Scientists developing or deploying GBDT models, understanding the theoretical convergence guarantees is crucial. The Gradient Regularized Newton (GRN) scheme offers a robust alternative to vanilla Newton boosting, providing global $\mathcal{O}(\frac{1}{k^{2}})$ convergence for general convex losses. You should consider implementing GRN, which incurs negligible computational overhead, especially when working with loss functions prone to divergence or when seeking stronger theoretical guarantees for model stability and performance.

Key insights

A gradient-regularized Newton boosting scheme achieves global convergence for GBDTs, addressing prior theoretical gaps.

Principles

Newton boosting can achieve linear convergence for Hessian-dominated losses.
Adaptive $\ell_{2}$-regularization enables global convergence for general convex losses.
Regularity of base loss functions translates to the boosting objective.

Method

The Gradient Regularized Newton (GRN) scheme modifies Newton boosting by adding an adaptive $\ell_{2}$-regularization term, $\lambda_{k}=\lambda_{\text{base}}+\sqrt{M\|g_{k}\|}$, to the objective function at each iteration.

In practice

Implement GRN by adding an adaptive $\ell_{2}$-regularization term.
Compute the adaptive term using the gradient norm and a Lipschitz constant.
Increasing tree depth improves weak learner conditions (cosine angle, weak gradient edge).

Topics

Gradient Boosting Decision Trees
Newton Boosting
Gradient Regularized Newton
Global Convergence Theory
Hilbert Space Optimization

Code references

jax-ml/jax

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Student

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by stat.ML updates on arXiv.org.