How Linear Is a Transformer Feed-Forward Block? Per-Block Linear Recoverability Is Learned, Not Architectural
Summary
Transformer feed-forward networks (FFNs) exhibit highly heterogeneous linear recoverability (R²₋lin), a measure of how much of their behavior is captured by a single affine layer. Across all twelve blocks of GPT-2, Pythia-160m, and llama-160m, R²₋lin is non-monotone and varies significantly, from nearly perfectly linear (>0.99) to strongly nonlinear (<0.3). This recoverability is a learned property of individual trained blocks, not an architectural feature or activation function, as evidenced by differing profiles in same-width GELU models. A low-rank bilinear probe shows the FFN residual is not low-order multiplicative. High R²₋lin can stem from low-rank, outlier-concentrated structure or high-rank, broadly linear structure. This measurement also functions as a compression signal, identifying blocks suitable for ×8 parameter reduction with minimal perplexity impact (e.g., GPT-2's early FFN for +0.77 perplexity). The study highlights the necessity of closed-form least-squares for accurate measurement due to ill-conditioned transformer activations.
Key takeaway
For AI Scientists and Research Scientists evaluating Transformer FFNs for compression or interpretability, recognize that FFN linearity is a learned, block-specific property, not dictated by architecture. You should employ closed-form least-squares to accurately measure linear recoverability (R²₋lin) and effective rank, as trained baselines can understate linearity. This approach helps identify specific FFN blocks suitable for significant parameter reduction (e.g., ×8) and provides a more nuanced understanding of their computational structure, guiding efficient model design and analysis.
Key insights
The linearity of Transformer FFN blocks is a learned, heterogeneous property, not architectural, and can be precisely measured for compression.
Principles
- FFN linearity is learned, not architecturally defined.
- Residual FFN computation is not low-order multiplicative.
- Closed-form least-squares is crucial for FFN linearity measurement.
Method
Decompose FFN input-output maps into an exact least-squares linear approximation and a residual. Measure linear recoverability (R²₋lin) using held-out variance explained. Probe residuals with low-rank bilinear layers.
In practice
- Use R²₋lin to identify FFN blocks for ×8 parameter compression.
- Report R² and ΔPPL for FFN compression studies.
- Employ closed-form least-squares for FFN linearity analysis.
Topics
- Transformer FFNs
- Linear Recoverability
- Model Compression
- Activation Distillation
- GPT-2
- LLaMA Architecture
Best for: AI Engineer, Machine Learning Engineer, NLP Engineer, AI Scientist, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.