Hierarchical vs. Flat Iteration in Shared-Weight Transformers
Summary
A new empirical study investigates the representational quality of hierarchically structured, shared-weight recurrence compared to independent-layer stacking in Transformer-based language models. The research introduces HRM-LM, a model that substitutes L independent Transformer layers with a two-speed recurrent pair: a Fast module for local refinement at every step and a Slow module for global compression every T steps. This recurrent hierarchy is unrolled for M = N x T steps using shared parameters. A key finding, consistently observed across five independent runs and supported by a parameter-matched Universal Transformer ablation (UniTF, 1.2B), reveals a significant empirical gap in performance between the hierarchical and flat iteration approaches.
Key takeaway
For research scientists exploring novel Transformer architectures, you should carefully evaluate the trade-offs between hierarchical recurrence and traditional independent-layer stacking. The observed empirical gap suggests that while parameter sharing offers efficiency, it may come at a representational quality cost that needs to be addressed through further architectural innovation or training strategies.
Key insights
Hierarchical recurrence in Transformers shows a sharp empirical gap compared to independent-layer stacking.
Principles
- Shared-weight recurrence can be structured hierarchically.
- Two-speed modules enable local refinement and global compression.
Method
HRM-LM replaces L independent Transformer layers with a Fast module (every step) and a Slow module (every T steps), unrolled for M = N x T steps with shared parameters.
Topics
- Shared-Weight Transformers
- Hierarchical Recurrence
- Universal Transformer
- Language Models
- Representational Quality
Best for: Research Scientist, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.