Nous Research Proposes Lighthouse Attention: A Training-Only Selection-Based Hierarchical Attention That Delivers 1.4–1.7× Pretraining Speedup at Long Context

· Source: Machine Learning ML & Generative AI News · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

Nous Research has introduced Lighthouse Attention, a novel selection-based hierarchical attention mechanism designed for long-context pretraining. This method symmetrically pools Query, Key, and Value tensors across a multi-level pyramid, performing selection outside the attention kernel and utilizing stock FlashAttention on a small dense subsequence. It achieves a 21x faster forward pass and 17.3x faster forward+backward compared to cuDNN SDPA at 512K context on a single B200 GPU. Lighthouse Attention delivers a 1.40–1.69x end-to-end pretraining wall-clock speedup at 98K context with matched or lower final training loss. After Lighthouse training, a brief dense-SDPA resumption recovers a full-attention model that outperforms a dense-from-scratch baseline (loss 0.6980 vs. 0.7237) within the same ~50.3B token budget. The approach scales to 1M-token training across 32 Blackwell GPUs using standard ring attention, requiring no sparse-aware collectives.

Key takeaway

For Research Scientists developing large language models with long contexts, Lighthouse Attention offers a compelling strategy to significantly accelerate pretraining. You can achieve substantial wall-clock speedups during training and then recover a high-performing dense model for inference, potentially outperforming models trained densely from scratch. Consider integrating this hierarchical selection approach to optimize your training workflows and reduce computational costs.

Key insights

Lighthouse Attention accelerates long-context pretraining by decoupling selection from the attention kernel, enabling faster training and dense model recovery.

Principles

Method

Lighthouse Attention pools Q, K, V symmetrically across a multi-level pyramid, performs selection outside the attention kernel, and runs stock FlashAttention on a small dense subsequence for pretraining.

In practice

Topics

Code references

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning ML & Generative AI News.