How Linear Is a Transformer Feed-Forward Block? Per-Block Linear Recoverability Is Learned, Not Architectural

· Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, extended

Summary

Transformer feed-forward networks (FFNs) exhibit highly heterogeneous linear recoverability (R²₋lin), a measure of how much of their behavior is captured by a single affine layer. Across all twelve blocks of GPT-2, Pythia-160m, and llama-160m, R²₋lin is non-monotone and varies significantly, from nearly perfectly linear (>0.99) to strongly nonlinear (<0.3). This recoverability is a learned property of individual trained blocks, not an architectural feature or activation function, as evidenced by differing profiles in same-width GELU models. A low-rank bilinear probe shows the FFN residual is not low-order multiplicative. High R²₋lin can stem from low-rank, outlier-concentrated structure or high-rank, broadly linear structure. This measurement also functions as a compression signal, identifying blocks suitable for ×8 parameter reduction with minimal perplexity impact (e.g., GPT-2's early FFN for +0.77 perplexity). The study highlights the necessity of closed-form least-squares for accurate measurement due to ill-conditioned transformer activations.

Key takeaway

For AI Scientists and Research Scientists evaluating Transformer FFNs for compression or interpretability, recognize that FFN linearity is a learned, block-specific property, not dictated by architecture. You should employ closed-form least-squares to accurately measure linear recoverability (R²₋lin) and effective rank, as trained baselines can understate linearity. This approach helps identify specific FFN blocks suitable for significant parameter reduction (e.g., ×8) and provides a more nuanced understanding of their computational structure, guiding efficient model design and analysis.

Key insights

The linearity of Transformer FFN blocks is a learned, heterogeneous property, not architectural, and can be precisely measured for compression.

Principles

Method

Decompose FFN input-output maps into an exact least-squares linear approximation and a residual. Measure linear recoverability (R²₋lin) using held-out variance explained. Probe residuals with low-rank bilinear layers.

In practice

Topics

Best for: AI Engineer, Machine Learning Engineer, NLP Engineer, AI Scientist, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.