The 60-Year Hunt for AI's Most Important Function
Summary
The feed forward network (FFN) layer within transformers relies on activation functions, which are crucial for a neural network's expressive power by determining signal propagation. Early activation functions like the hard threshold perceptron (1958) were non-differentiable, hindering gradient-based learning. The sigmoid function (1986) introduced differentiability and squashing, but suffered from the "zigzag problem" due to all-positive outputs constraining gradient directions, and the "vanishing gradient problem" where gradients become negligibly small in deep networks. The hyperbolic tangent (Tanh) addressed the zigzag problem by centering outputs at zero but retained vanishing gradients. The Rectified Linear Unit (ReLU) mitigated vanishing gradients and offered computational efficiency but introduced the "dying ReLU problem" where neurons cease learning. Subsequent innovations like Leaky ReLU and Parametric ReLU (PReLU) addressed dying ReLUs by introducing small, or learnable, slopes for negative inputs. Next-generation activations like Swish (SiLU) and Gaussian Error Linear Unit (GELU) introduced smoothness and gating principles, with GELU being widely adopted in models like BERT and GPT. The Gated Linear Unit (GLU) family further generalized this, with SwiGLU now prevalent in state-of-the-art large language models. Simplified gating, such as Squared ReLU, offers comparable performance with increased efficiency.
Key takeaway
For AI Scientists and Machine Learning Engineers designing or optimizing deep neural networks, understanding the evolution and properties of activation functions is critical. Your choice directly impacts training efficiency and model performance. Prioritize modern, smooth, and gated activations like SwiGLU or GELU for large language models, or explore Squared ReLU for its efficiency and competitive performance, ensuring your architecture avoids issues like vanishing gradients or dying neurons.
Key insights
Activation functions are critical for neural network expressiveness, evolving from simple thresholds to complex gated, smooth functions.
Principles
- Differentiability is essential for gradient-based learning.
- Zero-centered outputs improve gradient flow and optimization.
- Smooth, non-saturating activations mitigate vanishing gradients.
Method
The Gated Linear Unit (GLU) framework computes content and gate vectors from pre-activation, then element-wise multiplies them, allowing flexible, learned control over signal propagation.
In practice
- Use SwiGLU for state-of-the-art LLMs.
- Consider Squared ReLU for efficiency in large models.
- Implement PReLU when dying ReLU is a concern.
Topics
- Activation Functions
- Feed Forward Networks
- Vanishing Gradient Problem
- Gated Linear Units
- SwiGLU
Best for: AI Scientist, Machine Learning Engineer, AI Student
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Jia-Bin Huang.