AdaGrad-Diff: A New Version of the Adaptive Gradient Algorithm

2026-02-16 · Source: stat.ML updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, extended

Summary

AdaGrad-Diff is a novel adaptive gradient algorithm that modifies the traditional AdaGrad method by using cumulative squared norms of successive gradient differences to drive stepsize adaptation, rather than cumulative squared gradient norms. This approach aims to prevent unnecessary stepsize reduction when gradients are stable, while automatically damping stepsizes during significant gradient fluctuations, which often indicate curvature or instability. The algorithm provides theoretical convergence rates of $\mathcal{O}(1/\sqrt{n})$ for G-Lipschitz continuous functions and $\mathcal{O}(1/n)$ for L-Lipschitz smooth functions, along with weak convergence of iterates to a minimizer for the smooth case. Numerical experiments across five convex optimization problems, including Hinge Loss, LAD Regression, Logistic Regression, and SVM Classification, demonstrate that AdaGrad-Diff exhibits enhanced robustness to the choice of the stepsize parameter $\eta$ compared to standard AdaGrad, reducing the need for extensive hyperparameter tuning.

Key takeaway

Research Scientists working with gradient-based optimization should consider AdaGrad-Diff to enhance the stability and reduce the hyperparameter tuning burden of their models. Its difference-based stepsize adaptation mechanism offers superior robustness to the choice of $\eta$ compared to traditional AdaGrad, leading to more consistent performance across a wider range of settings. This can significantly streamline the development and deployment of machine learning algorithms, particularly in scenarios where extensive hyperparameter search is impractical.

Key insights

AdaGrad-Diff improves optimization robustness by adapting stepsizes based on gradient differences, not just gradient magnitudes.

Principles

Stable gradients allow larger stepsizes.
Fluctuating gradients require stepsize damping.
Difference-based adaptation enhances hyperparameter robustness.

Method

AdaGrad-Diff calculates stepsize adaptation using cumulative squared norms of successive gradient differences, defined as $w_{i}^{n}:=\varepsilon+\sqrt{\sum_{k=1}^{n}\lVert{g_{i}^{k}-g_{i}^{k-1}}\rVert^{2}}$, within a proximal gradient framework.

In practice

Use AdaGrad-Diff for improved stepsize robustness.
Apply to convex optimization problems with smooth or non-smooth losses.
Consider for tasks like Logistic Regression or SVM Classification.

Topics

Adaptive Gradient Algorithms
AdaGrad-Diff
Stepsize Adaptation
Convergence Analysis
Hyperparameter Robustness

Best for: Research Scientist, AI Researcher, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by stat.ML updates on arXiv.org.