Qwen Team Releases FlashQLA: a High-Performance Linear Attention Kernel Library That Achieves Up to 3× Speedup
Summary
The Qwen Team has released FlashQLA, a new high-performance linear attention kernel library designed to accelerate GDN (Gated Delta Network) Chunked Prefill, the linear attention mechanism used in Qwen3.5 and Qwen3.6 model families. Benchmarked against FLA 0.5.0, Triton 3.5.1, and FlashInfer 0.6.9 on NVIDIA Hopper (H200) GPUs, FlashQLA achieves a 2-3x speedup for forward passes and a 2x speedup for backward passes over the FLA Triton kernel. Its performance gains stem from three key optimizations: gate-driven automatic intra-card context parallelism, hardware-friendly algebraic reformulation to reduce overhead, and TileLang fused warp-specialized kernels that overlap data movement and computation.
Key takeaway
For AI Engineers deploying Qwen3.5 or Qwen3.6 models on NVIDIA Hopper GPUs, integrating FlashQLA can significantly reduce inference and training times. You should consider adopting this library to achieve up to 3x faster forward passes and 2x faster backward passes, directly improving model throughput and efficiency without complex manual configuration.
Key insights
FlashQLA significantly accelerates linear attention for Qwen models via specialized kernel optimizations.
Principles
- Exploit gate properties for parallelism
- Reformulate algebra for hardware efficiency
- Specialize kernels for computation overlap
Method
FlashQLA uses TileLang to implement warp-specialized kernels, enabling automatic intra-card context parallelism and algebraic reformulation to optimize GDN Chunked Prefill.
In practice
- Integrate FlashQLA for Qwen3.5/3.6 inference
- Utilize on NVIDIA Hopper GPUs for speedup
Topics
- FlashQLA
- Linear Attention
- Kernel Library
- NVIDIA Hopper
- GDN Chunked Prefill
Code references
Best for: AI Engineer, NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Hardware Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning ML & Generative AI News.