Geometric Monomial (GEM): a family of rational 2N-differentiable activation functions
Summary
A new family of $C^{2N}$-smooth activation functions, called Geometric Monomial (GEM), is proposed to address the limitations of ReLU in deep neural networks. These functions utilize a log-logistic CDF gate and purely rational arithmetic, aiming for ReLU-like performance with enhanced smoothness for gradient-based optimization. The family includes three variants: GEM (base), E-GEM (an $\varepsilon$-parameterized generalization for arbitrary $L^p$-approximation of ReLU), and SE-GEM (a piecewise variant preventing dead neurons with $C^{2N}$ junction smoothness). An $N$-ablation study found $N=1$ optimal for standard-depth networks, reducing the GELU deficit on CIFAR-100 + ResNet-56 from 6.10% to 2.12%. The study also revealed an $N=1$ preference for deep CNNs and $N=2$ for transformers. E-GEM achieved 99.23% on MNIST, while SE-GEM ($\varepsilon=10^{-4}$) surpassed GELU on CIFAR-10 + ResNet-56 (92.51% vs 92.44%). On GPT-2 (124M), GEM achieved the lowest perplexity (72.57 vs 73.76 for GELU), and E-GEM ($\varepsilon=10$) achieved the best validation loss (6.656) on BERT-small.
Key takeaway
For AI Engineers developing or optimizing deep neural networks, consider integrating the GEM family of activation functions. E-GEM and SE-GEM variants demonstrate superior performance over GELU in specific benchmarks like CIFAR-10 + ResNet-56 and GPT-2, while offering $C^{2N}$-smoothness that can improve gradient-based optimization. Evaluate $N=1$ for CNNs and $N=2$ for transformers, and experiment with the $\varepsilon$-parameterization to fine-tune performance for your specific model architecture and depth.
Key insights
GEM activation functions offer $C^{2N}$-smoothness and rational arithmetic, outperforming GELU in various deep learning benchmarks.
Principles
- Smoothness aids deep network optimization.
- Optimal activation parameters are architecture-dependent.
Method
The GEM family uses a log-logistic CDF gate for $C^{2N}$-smoothness, with variants like E-GEM for $L^p$-approximation and SE-GEM for dead neuron elimination, all relying on rational arithmetic.
In practice
- Use GEM $N=1$ for deep CNNs.
- Use GEM $N=2$ for transformers.
- Adjust $\varepsilon$ for E-GEM based on network depth.
Topics
- Geometric Monomial
- C2N-smooth Activation Functions
- Deep Neural Networks
- CNN-Transformer Performance
- GPT-2 Language Model
Best for: Research Scientist, AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.