Geometric Monomial (GEM): a family of rational 2N-differentiable activation functions

2026-04-23 · Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

A new family of $C^{2N}$-smooth activation functions, called Geometric Monomial (GEM), is proposed to address the limitations of ReLU in deep neural networks. These functions utilize a log-logistic CDF gate and purely rational arithmetic, aiming for ReLU-like performance with enhanced smoothness for gradient-based optimization. The family includes three variants: GEM (base), E-GEM (an $\varepsilon$-parameterized generalization for arbitrary $L^p$-approximation of ReLU), and SE-GEM (a piecewise variant preventing dead neurons with $C^{2N}$ junction smoothness). An $N$-ablation study found $N=1$ optimal for standard-depth networks, reducing the GELU deficit on CIFAR-100 + ResNet-56 from 6.10% to 2.12%. The study also revealed an $N=1$ preference for deep CNNs and $N=2$ for transformers. E-GEM achieved 99.23% on MNIST, while SE-GEM ($\varepsilon=10^{-4}$) surpassed GELU on CIFAR-10 + ResNet-56 (92.51% vs 92.44%). On GPT-2 (124M), GEM achieved the lowest perplexity (72.57 vs 73.76 for GELU), and E-GEM ($\varepsilon=10$) achieved the best validation loss (6.656) on BERT-small.

Key takeaway

For AI Engineers developing or optimizing deep neural networks, consider integrating the GEM family of activation functions. E-GEM and SE-GEM variants demonstrate superior performance over GELU in specific benchmarks like CIFAR-10 + ResNet-56 and GPT-2, while offering $C^{2N}$-smoothness that can improve gradient-based optimization. Evaluate $N=1$ for CNNs and $N=2$ for transformers, and experiment with the $\varepsilon$-parameterization to fine-tune performance for your specific model architecture and depth.

Key insights

GEM activation functions offer $C^{2N}$-smoothness and rational arithmetic, outperforming GELU in various deep learning benchmarks.

Principles

Smoothness aids deep network optimization.
Optimal activation parameters are architecture-dependent.

Method

The GEM family uses a log-logistic CDF gate for $C^{2N}$-smoothness, with variants like E-GEM for $L^p$-approximation and SE-GEM for dead neuron elimination, all relying on rational arithmetic.

In practice

Use GEM $N=1$ for deep CNNs.
Use GEM $N=2$ for transformers.
Adjust $\varepsilon$ for E-GEM based on network depth.

Topics

Geometric Monomial
C2N-smooth Activation Functions
Deep Neural Networks
CNN-Transformer Performance
GPT-2 Language Model

Best for: Research Scientist, AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.