CARVE: Content-Aware Recurrent with Value Efficiency for Chunk-Parallel Linear Attention

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

CARVE (Content-Aware Recurrent with Value Efficiency) is a new architecture for chunk-parallel linear attention that resolves three defects in the leading delta-rule architecture, GDN-2. It addresses memory-blind gating, inefficient value-axis erase masks, and the mathematical prevention of the WY-form triangular chunk solver. CARVE's core principle is to erase only on the key axis, which is proven necessary and sufficient for the WY-form solver. It reuses the recurrent output tensor as a free content signal for the erase gate and replaces the per-value write-gate projection with a single scalar per head. Trained with 1.3B parameters on 100B tokens, CARVE achieves a WikiText perplexity of 15.72, a 0.18 improvement over GDN-2. It also leads all recurrent baselines on nine common-sense reasoning benchmarks and achieves top results on every RULER retrieval probe, all while incurring only 0.4% throughput overhead, 13% lower peak memory, and using 19% fewer parameters.

Key takeaway

For Machine Learning Engineers optimizing recurrent neural networks, CARVE offers a significant architectural advancement. If you are deploying large recurrent models, consider adopting CARVE to achieve superior performance on language modeling and common-sense reasoning tasks. Its design provides a 0.18 perplexity improvement over GDN-2, while simultaneously reducing peak memory by 13% and parameter count by 19%, making it a compelling choice for efficient and high-performing recurrent model development.

Key insights

CARVE improves recurrent models by content-aware key-axis memory erasure, enabling efficient chunk-parallel training and superior performance.

Principles

Method

CARVE resolves GDN-2 defects by erasing memory solely on the key axis, reusing recurrent output for content-aware gating, and using a single scalar per head for value projection.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.