YaRN: Extending RoPE Without Breaking It

· Source: NLP on Medium · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Advanced, long

Summary

YaRN (Yet another RoPE extensioN) is a highly efficient method for extending the context window of pre-trained Large Language Models, such as LLaMA 2, from 4,096 to 128,000 tokens with significantly reduced fine-tuning costs. It builds upon and unifies prior techniques like Position Interpolation and NTK-aware scaling, specifically leveraging "NTK-by-parts" for per-dimension frequency interpolation and introducing temperature scaling to mitigate attention entropy in long sequences. Experiments show YaRN achieves 99.4% Passkey retrieval accuracy at 128K context, preserves short-context performance, and offers up to 16x greater training efficiency compared to other methods. Its compatibility with Flash Attention 2 and a "Dynamic YaRN" variant for inference-time scaling make it a leading practical solution for long-context LLM applications, despite limitations regarding hyperparameter transferability and quadratic attention complexity.

Key takeaway

YaRN efficiently extends LLM context windows from 4K to 128K by integrating NTK-by-Parts interpolation and attention temperature scaling to mitigate sparsity. This approach achieves 99.4% passkey accuracy at 128K with 16x greater training efficiency and minimal short-context performance degradation. It enables practical, cost-effective adaptation of LLaMA-based models for processing extensive documents without expensive retraining.

Topics

Best for: AI Scientist, Research Scientist, NLP Engineer, AI Engineer, Machine Learning Engineer, AI Researcher

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by NLP on Medium.