CARVE: Content-Aware Recurrent with Value Efficiency for Chunk-Parallel Linear Attention
Summary
CARVE (Content-Aware Recurrent with Value Efficiency) is a new architecture for chunk-parallel linear attention that resolves three defects in the leading delta-rule architecture, GDN-2. It addresses memory-blind gating, inefficient value-axis erase masks, and the mathematical prevention of the WY-form triangular chunk solver. CARVE's core principle is to erase only on the key axis, which is proven necessary and sufficient for the WY-form solver. It reuses the recurrent output tensor as a free content signal for the erase gate and replaces the per-value write-gate projection with a single scalar per head. Trained with 1.3B parameters on 100B tokens, CARVE achieves a WikiText perplexity of 15.72, a 0.18 improvement over GDN-2. It also leads all recurrent baselines on nine common-sense reasoning benchmarks and achieves top results on every RULER retrieval probe, all while incurring only 0.4% throughput overhead, 13% lower peak memory, and using 19% fewer parameters.
Key takeaway
For Machine Learning Engineers optimizing recurrent neural networks, CARVE offers a significant architectural advancement. If you are deploying large recurrent models, consider adopting CARVE to achieve superior performance on language modeling and common-sense reasoning tasks. Its design provides a 0.18 perplexity improvement over GDN-2, while simultaneously reducing peak memory by 13% and parameter count by 19%, making it a compelling choice for efficient and high-performing recurrent model development.
Key insights
CARVE improves recurrent models by content-aware key-axis memory erasure, enabling efficient chunk-parallel training and superior performance.
Principles
- Erasing memory only on the key axis is crucial for WY-form chunk solvers.
- Reusing recurrent output as a content signal for gates is efficient.
- Replacing per-value write-gates with a single scalar reduces parameters.
Method
CARVE resolves GDN-2 defects by erasing memory solely on the key axis, reusing recurrent output for content-aware gating, and using a single scalar per head for value projection.
In practice
- Achieve WikiText perplexity 15.72 with 1.3B parameters.
- Improve common-sense reasoning benchmarks over GDN-2.
- Reduce memory by 13% and parameters by 19% in recurrent models.
Topics
- Linear Attention
- Recurrent Neural Networks
- CARVE Architecture
- Model Efficiency
- Language Modeling
- Common-Sense Reasoning
Best for: Research Scientist, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.