Singularity-aware Optimization via Randomized Geometric Probing: Towards Stable Non-smooth Optimization
Summary
Singularity-aware Adam (S-Adam) is a novel optimizer designed to address the challenges of non-smooth loss landscapes prevalent in modern deep learning architectures, which often feature components like ReLU activations and quantization operators. These non-smooth conditions cause adaptive optimizers such as Adam to suffer from gradient chattering, leading to poor convergence and suboptimal generalization. S-Adam stabilizes training by dynamically modulating step sizes based on a Local Geometric Instability (LGI) metric, an efficient estimator of the Clarke subdifferential diameter derived from randomized directional derivatives. It incorporates an adaptive damping mechanism, exp(-\u03bb\u03c1), to decelerate updates in high-instability regions while maintaining fast convergence in smooth areas. Rigorous analysis proves S-Adam converges almost surely to (\u03b4,\u03b5)-Clarke stationary points at an optimal O(1/\u221aT) rate. Empirical tests on Quantization-Aware Training (QAT) and high-noise small-batch learning show S-Adam outperforms AdamW and Prox-SGD, achieving accuracy gains up to 6% on CIFAR-100 and 3% on TinyImageNet.
Key takeaway
For machine learning engineers struggling with convergence in models featuring non-smooth components, S-Adam offers a robust solution to mitigate gradient chattering and improve generalization. You should consider integrating S-Adam, especially for Quantization-Aware Training or small-batch scenarios, to achieve significant accuracy gains and more stable training dynamics. This can lead to more reliable model performance in challenging optimization environments.
Key insights
S-Adam stabilizes non-smooth deep learning optimization by dynamically adjusting step sizes based on local geometric instability, preventing gradient chattering.
Principles
- Non-smooth loss landscapes cause gradient chattering.
- Local geometric instability guides step size modulation.
- Clarke subdifferential diameter estimates local instability.
Method
S-Adam computes a Local Geometric Instability (LGI) metric via randomized directional derivatives to estimate Clarke subdifferential diameter, then applies an adaptive damping mechanism exp(-\u03bb\u03c1) to modulate step sizes.
In practice
- Apply S-Adam for Quantization-Aware Training.
- Use S-Adam in high-noise small-batch learning.
- Mitigate gradient oscillations in non-smooth models.
Topics
- Deep Learning Optimization
- Non-smooth Optimization
- Adaptive Optimizers
- Quantization-Aware Training
- Gradient Chattering
- S-Adam
Best for: Research Scientist, AI Engineer, Computer Vision Engineer, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.