Nous Research Proposes Lighthouse Attention: A Training-Only Selection-Based Hierarchical Attention That Delivers 1.4–1.7× Pretraining Speedup at Long Context
Summary
Nous Research has introduced Lighthouse Attention, a novel selection-based hierarchical attention mechanism designed for long-context pretraining. This method symmetrically pools Query, Key, and Value tensors across a multi-level pyramid, performing selection outside the attention kernel and utilizing stock FlashAttention on a small dense subsequence. It achieves a 21x faster forward pass and 17.3x faster forward+backward compared to cuDNN SDPA at 512K context on a single B200 GPU. Lighthouse Attention delivers a 1.40–1.69x end-to-end pretraining wall-clock speedup at 98K context with matched or lower final training loss. After Lighthouse training, a brief dense-SDPA resumption recovers a full-attention model that outperforms a dense-from-scratch baseline (loss 0.6980 vs. 0.7237) within the same ~50.3B token budget. The approach scales to 1M-token training across 32 Blackwell GPUs using standard ring attention, requiring no sparse-aware collectives.
Key takeaway
For Research Scientists developing large language models with long contexts, Lighthouse Attention offers a compelling strategy to significantly accelerate pretraining. You can achieve substantial wall-clock speedups during training and then recover a high-performing dense model for inference, potentially outperforming models trained densely from scratch. Consider integrating this hierarchical selection approach to optimize your training workflows and reduce computational costs.
Key insights
Lighthouse Attention accelerates long-context pretraining by decoupling selection from the attention kernel, enabling faster training and dense model recovery.
Principles
- Separate selection from attention kernel
- Utilize hierarchical pooling
- Recover dense model post-training
Method
Lighthouse Attention pools Q, K, V symmetrically across a multi-level pyramid, performs selection outside the attention kernel, and runs stock FlashAttention on a small dense subsequence for pretraining.
In practice
- Achieve 1.4-1.7x pretraining speedup
- Train 1M-token models on 32 Blackwell GPUs
- Recover full-attention models for inference
Topics
- Nous Research
- Lighthouse Attention
- Hierarchical Attention
- Long Context Pretraining
- FlashAttention
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Architect
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning ML & Generative AI News.