Singularity-aware Optimization via Randomized Geometric Probing: Towards Stable Non-smooth Optimization

· Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Mathematics & Computational Sciences · Depth: Expert, quick

Summary

Singularity-aware Adam (S-Adam) is a novel optimizer designed to address the challenges of non-smooth loss landscapes prevalent in modern deep learning architectures, which often feature components like ReLU activations and quantization operators. These non-smooth conditions cause adaptive optimizers such as Adam to suffer from gradient chattering, leading to poor convergence and suboptimal generalization. S-Adam stabilizes training by dynamically modulating step sizes based on a Local Geometric Instability (LGI) metric, an efficient estimator of the Clarke subdifferential diameter derived from randomized directional derivatives. It incorporates an adaptive damping mechanism, exp(-\u03bb\u03c1), to decelerate updates in high-instability regions while maintaining fast convergence in smooth areas. Rigorous analysis proves S-Adam converges almost surely to (\u03b4,\u03b5)-Clarke stationary points at an optimal O(1/\u221aT) rate. Empirical tests on Quantization-Aware Training (QAT) and high-noise small-batch learning show S-Adam outperforms AdamW and Prox-SGD, achieving accuracy gains up to 6% on CIFAR-100 and 3% on TinyImageNet.

Key takeaway

For machine learning engineers struggling with convergence in models featuring non-smooth components, S-Adam offers a robust solution to mitigate gradient chattering and improve generalization. You should consider integrating S-Adam, especially for Quantization-Aware Training or small-batch scenarios, to achieve significant accuracy gains and more stable training dynamics. This can lead to more reliable model performance in challenging optimization environments.

Key insights

S-Adam stabilizes non-smooth deep learning optimization by dynamically adjusting step sizes based on local geometric instability, preventing gradient chattering.

Principles

Method

S-Adam computes a Local Geometric Instability (LGI) metric via randomized directional derivatives to estimate Clarke subdifferential diameter, then applies an adaptive damping mechanism exp(-\u03bb\u03c1) to modulate step sizes.

In practice

Topics

Best for: Research Scientist, AI Engineer, Computer Vision Engineer, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.