TileFuse: A Fused Mixed-Precision Kernel Library for Efficient Quantized LLM Inference on AMD NPUs
Summary
TileFuse is a close-to-metal mixed-precision kernel library designed for AMD XDNA2 NPUs, specifically targeting transformer linear layers in quantized Large Language Model (LLM) inference. It addresses the challenge of deploying LLMs on client NPUs, which often lack native support for common quantization formats like AWQ. TileFuse directly enables AWQ-style W4A16 and W8A16 formats on XDNA2 by co-designing weight layout, metadata placement, mixed-precision microkernels, and array-level dataflow. The library fuses unpacking, dequantization, and GEMM/GEMV execution into a single kernel flow, supports GEMM dimensions up to 32K with an interleaved pre-tiling layout, and optimizes GEMV dataflow for the 4x8 AIE array. Kernel-level evaluations show TileFuse improves GEMM performance by up to 121.6% and GEMV by 281% over full-precision baselines, also delivering over 2x performance and energy-efficiency gains compared to iGPU baselines on GEMM. End-to-end LLM experiments on Ryzen AI laptops demonstrate up to 2.0x lower prefilling latency and over 64.6% lower energy consumption.
Key takeaway
For AI Engineers deploying quantized LLMs on AMD XDNA2 NPUs, TileFuse significantly improves inference efficiency. You can now directly utilize AWQ-style W4A16 and W8A16 quantization, avoiding model reshaping for NPU-specific schemes. This enables up to 2.0x lower prefilling latency and over 64.6% reduced energy consumption on Ryzen AI laptops, making XDNA2 a more viable target for edge LLM deployments. Consider integrating such fused kernel libraries to maximize NPU performance.
Key insights
TileFuse enables efficient AWQ-style quantized LLM inference on AMD XDNA2 NPUs through fused mixed-precision kernels.
Principles
- NPUs need native support for common quantization.
- Co-designing hardware-software improves efficiency.
- Fusing operations reduces overhead in inference.
Method
TileFuse co-designs weight layout, metadata placement, mixed-precision microkernels, and array-level dataflow, fusing unpacking, dequantization, and GEMM/GEMV execution into a single kernel.
In practice
- Deploy AWQ-style W4A16/W8A16 on AMD XDNA2.
- Achieve 2.0x lower LLM prefilling latency.
- Reduce LLM energy consumption by >64.6%.
Topics
- TileFuse
- AMD XDNA2 NPUs
- Quantized LLM Inference
- Mixed-Precision Kernels
- AWQ Quantization
- Edge LLMs
Best for: AI Engineer, Machine Learning Engineer, AI Hardware Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.