Qwen Team Releases FlashQLA: a High-Performance Linear Attention Kernel Library That Achieves Up to 3× Speedup

· Source: Machine Learning ML & Generative AI News · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Emerging Technologies & Innovation · Depth: Expert, quick

Summary

The Qwen Team has released FlashQLA, a new high-performance linear attention kernel library designed to accelerate GDN (Gated Delta Network) Chunked Prefill, the linear attention mechanism used in Qwen3.5 and Qwen3.6 model families. Benchmarked against FLA 0.5.0, Triton 3.5.1, and FlashInfer 0.6.9 on NVIDIA Hopper (H200) GPUs, FlashQLA achieves a 2-3x speedup for forward passes and a 2x speedup for backward passes over the FLA Triton kernel. Its performance gains stem from three key optimizations: gate-driven automatic intra-card context parallelism, hardware-friendly algebraic reformulation to reduce overhead, and TileLang fused warp-specialized kernels that overlap data movement and computation.

Key takeaway

For AI Engineers deploying Qwen3.5 or Qwen3.6 models on NVIDIA Hopper GPUs, integrating FlashQLA can significantly reduce inference and training times. You should consider adopting this library to achieve up to 3x faster forward passes and 2x faster backward passes, directly improving model throughput and efficiency without complex manual configuration.

Key insights

FlashQLA significantly accelerates linear attention for Qwen models via specialized kernel optimizations.

Principles

Method

FlashQLA uses TileLang to implement warp-specialized kernels, enabling automatic intra-card context parallelism and algebraic reformulation to optimize GDN Chunked Prefill.

In practice

Topics

Code references

Best for: AI Engineer, NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Hardware Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning ML & Generative AI News.