TileFuse: A Fused Mixed-Precision Kernel Library for Efficient Quantized LLM Inference on AMD NPUs

2026-06-09 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, AI Hardware Optimization · Depth: Expert, quick

Summary

TileFuse is a close-to-metal mixed-precision kernel library designed for AMD XDNA2 NPUs, specifically targeting transformer linear layers in quantized Large Language Model (LLM) inference. It addresses the challenge of deploying LLMs on client NPUs, which often lack native support for common quantization formats like AWQ. TileFuse directly enables AWQ-style W4A16 and W8A16 formats on XDNA2 by co-designing weight layout, metadata placement, mixed-precision microkernels, and array-level dataflow. The library fuses unpacking, dequantization, and GEMM/GEMV execution into a single kernel flow, supports GEMM dimensions up to 32K with an interleaved pre-tiling layout, and optimizes GEMV dataflow for the 4x8 AIE array. Kernel-level evaluations show TileFuse improves GEMM performance by up to 121.6% and GEMV by 281% over full-precision baselines, also delivering over 2x performance and energy-efficiency gains compared to iGPU baselines on GEMM. End-to-end LLM experiments on Ryzen AI laptops demonstrate up to 2.0x lower prefilling latency and over 64.6% lower energy consumption.

Key takeaway

For AI Engineers deploying quantized LLMs on AMD XDNA2 NPUs, TileFuse significantly improves inference efficiency. You can now directly utilize AWQ-style W4A16 and W8A16 quantization, avoiding model reshaping for NPU-specific schemes. This enables up to 2.0x lower prefilling latency and over 64.6% reduced energy consumption on Ryzen AI laptops, making XDNA2 a more viable target for edge LLM deployments. Consider integrating such fused kernel libraries to maximize NPU performance.

Key insights

TileFuse enables efficient AWQ-style quantized LLM inference on AMD XDNA2 NPUs through fused mixed-precision kernels.

Principles

NPUs need native support for common quantization.
Co-designing hardware-software improves efficiency.
Fusing operations reduces overhead in inference.

Method

TileFuse co-designs weight layout, metadata placement, mixed-precision microkernels, and array-level dataflow, fusing unpacking, dequantization, and GEMM/GEMV execution into a single kernel.

In practice

Deploy AWQ-style W4A16/W8A16 on AMD XDNA2.
Achieve 2.0x lower LLM prefilling latency.
Reduce LLM energy consumption by >64.6%.

Topics

TileFuse
AMD XDNA2 NPUs
Quantized LLM Inference
Mixed-Precision Kernels
AWQ Quantization
Edge LLMs

Best for: AI Engineer, Machine Learning Engineer, AI Hardware Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.