A Dual-Path Architecture for Scaling Compute and Capacity in LLMs
Summary
A novel dual-path architecture addresses the capacity limitations of looped transformers, which efficiently scale compute but lack capacity at fixed FLOPs. This architecture introduces a single layer with parallel pathways: a deep sublayer reapplied K times with shared parameters to scale compute, and a wide sublayer featuring an enlarged feed-forward network applied once to scale capacity. Independent per-token gates dynamically combine these axes. The dual-path model surpasses iso-FLOP matched models on language modeling and downstream evaluations across two FLOP budgets, while using fewer parameters than baseline models at matched FLOPs. Learned gates reveal systematic per-token allocation, with function words and lexical content trending wide, and punctuation, symbols, and arithmetic tokens trending deep.
Key takeaway
For AI Scientists and Machine Learning Engineers designing large language models, this dual-path architecture offers a compelling strategy to scale both computational intensity and model capacity more efficiently than traditional looped transformers. You should consider implementing this approach to achieve superior performance on language modeling tasks while reducing parameter counts, particularly when optimizing for specific FLOP budgets. This method also provides interpretable per-token routing insights.
Key insights
A dual-path architecture scales LLM compute and capacity simultaneously using deep and wide sublayers with per-token gates.
Principles
- Scaling compute and capacity can be decoupled.
- Per-token routing optimizes resource allocation.
- Shared parameters enable parameter-efficient scaling.
Method
The dual-path block applies a deep sublayer K times and a wide sublayer once, combining outputs via independent per-token gates for flexible compute/capacity scaling.
In practice
- Design LLMs with flexible compute/capacity.
- Optimize token processing based on type.
- Reduce parameter count for performance.
Topics
- Dual-Path Architecture
- Large Language Models
- Looped Transformers
- Per-Token Routing
- Parameter Efficiency
- Model Scaling
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.