Mixture-of-Depths Attention

2026-03-16 · Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Natural Language Processing · Depth: Advanced, quick

Summary

Mixture-of-Depths Attention (MoDA) is a novel mechanism designed to mitigate signal degradation in deep Large Language Models (LLMs), a common issue where features from shallow layers are diluted by repeated residual updates. MoDA enables each attention head to access Key-Value (KV) pairs from both the current layer and preceding layers. A hardware-efficient algorithm for MoDA addresses non-contiguous memory access, achieving 97.3% efficiency compared to FlashAttention-2 at a sequence length of 64K. Experiments with 1.5B-parameter models show MoDA improves average perplexity by 0.2 across 10 validation benchmarks and boosts average performance by 2.11% on 10 downstream tasks, incurring only a 3.7% FLOPs overhead. Combining MoDA with post-norm also yields superior performance over pre-norm.

Key takeaway

For AI engineers developing or scaling large language models, MoDA offers a practical approach to enhance model performance and depth without significant computational overhead. You should consider integrating MoDA, especially when working with models exceeding 1.5 billion parameters, and prioritize its combination with post-norm architectures to maximize benefits in perplexity and downstream task performance. This could be crucial for future LLM architectures.

Key insights

MoDA improves LLM depth scaling by allowing attention heads to access KV pairs from multiple layers, mitigating signal degradation.

Principles

Signal degradation impacts deep LLMs.
Accessing prior layer features improves performance.

Method

MoDA allows attention heads to attend to current layer KV pairs and depth KV pairs from preceding layers, using a hardware-efficient algorithm for non-contiguous memory access.

In practice

Implement MoDA for deeper LLMs.
Combine MoDA with post-norm for better results.

Topics

Mixture-of-Depths Attention
Large Language Models
Attention Mechanisms
Depth Scaling
Model Perplexity

Code references

hustvl/MoDA

Best for: Research Scientist, AI Engineer, Machine Learning Engineer, AI Researcher, AI Scientist, Deep Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.