YaRN: Extending RoPE Without Breaking It
Summary
YaRN (Yet another RoPE extensioN) is a highly efficient method for extending the context window of pre-trained Large Language Models, such as LLaMA 2, from 4,096 to 128,000 tokens with significantly reduced fine-tuning costs. It builds upon and unifies prior techniques like Position Interpolation and NTK-aware scaling, specifically leveraging "NTK-by-parts" for per-dimension frequency interpolation and introducing temperature scaling to mitigate attention entropy in long sequences. Experiments show YaRN achieves 99.4% Passkey retrieval accuracy at 128K context, preserves short-context performance, and offers up to 16x greater training efficiency compared to other methods. Its compatibility with Flash Attention 2 and a "Dynamic YaRN" variant for inference-time scaling make it a leading practical solution for long-context LLM applications, despite limitations regarding hyperparameter transferability and quadratic attention complexity.
Key takeaway
YaRN efficiently extends LLM context windows from 4K to 128K by integrating NTK-by-Parts interpolation and attention temperature scaling to mitigate sparsity. This approach achieves 99.4% passkey accuracy at 128K with 16x greater training efficiency and minimal short-context performance degradation. It enables practical, cost-effective adaptation of LLaMA-based models for processing extensive documents without expensive retraining.
Topics
- YaRN
- Rotary Position Embedding
- Context Window Extension
- Large Language Models
- Attention Mechanisms
Best for: AI Scientist, Research Scientist, NLP Engineer, AI Engineer, Machine Learning Engineer, AI Researcher
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by NLP on Medium.