The 60-Year Hunt for AI's Most Important Function

2026-05-15 · Source: Jia-Bin Huang · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Mathematics & Computational Sciences, Software Development & Engineering · Depth: Advanced, long

Summary

The feed forward network (FFN) layer within transformers relies on activation functions, which are crucial for a neural network's expressive power by determining signal propagation. Early activation functions like the hard threshold perceptron (1958) were non-differentiable, hindering gradient-based learning. The sigmoid function (1986) introduced differentiability and squashing, but suffered from the "zigzag problem" due to all-positive outputs constraining gradient directions, and the "vanishing gradient problem" where gradients become negligibly small in deep networks. The hyperbolic tangent (Tanh) addressed the zigzag problem by centering outputs at zero but retained vanishing gradients. The Rectified Linear Unit (ReLU) mitigated vanishing gradients and offered computational efficiency but introduced the "dying ReLU problem" where neurons cease learning. Subsequent innovations like Leaky ReLU and Parametric ReLU (PReLU) addressed dying ReLUs by introducing small, or learnable, slopes for negative inputs. Next-generation activations like Swish (SiLU) and Gaussian Error Linear Unit (GELU) introduced smoothness and gating principles, with GELU being widely adopted in models like BERT and GPT. The Gated Linear Unit (GLU) family further generalized this, with SwiGLU now prevalent in state-of-the-art large language models. Simplified gating, such as Squared ReLU, offers comparable performance with increased efficiency.

Key takeaway

For AI Scientists and Machine Learning Engineers designing or optimizing deep neural networks, understanding the evolution and properties of activation functions is critical. Your choice directly impacts training efficiency and model performance. Prioritize modern, smooth, and gated activations like SwiGLU or GELU for large language models, or explore Squared ReLU for its efficiency and competitive performance, ensuring your architecture avoids issues like vanishing gradients or dying neurons.

Key insights

Activation functions are critical for neural network expressiveness, evolving from simple thresholds to complex gated, smooth functions.

Principles

Differentiability is essential for gradient-based learning.
Zero-centered outputs improve gradient flow and optimization.
Smooth, non-saturating activations mitigate vanishing gradients.

Method

The Gated Linear Unit (GLU) framework computes content and gate vectors from pre-activation, then element-wise multiplies them, allowing flexible, learned control over signal propagation.

In practice

Use SwiGLU for state-of-the-art LLMs.
Consider Squared ReLU for efficiency in large models.
Implement PReLU when dying ReLU is a concern.

Topics

Activation Functions
Feed Forward Networks
Vanishing Gradient Problem
Gated Linear Units
SwiGLU

Best for: AI Scientist, Machine Learning Engineer, AI Student

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Jia-Bin Huang.