HASTE: Hardware-Aware Dynamic Sparse Training for Large Output Spaces

2026-05-31 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

HASTE introduces a hardware-aware dynamic sparse training method designed to address the memory-compute bottleneck in extreme multi-label classification (XMC) models, which often involve millions of labels. This approach utilizes group-shared fixed fan-in sparsity, a semi-structured output-layer design where semantically related labels share a sparse input pattern but maintain independent weights. This design enhances feature reuse and enables efficient GPU execution through custom CUDA kernels, leveraging modern accelerator primitives. HASTE also decomposes the output layer into a small dense head for frequent labels and a group-shared sparse tail for the rest, providing an informative gradient pathway. Microbenchmarking demonstrates significant wall-clock gains, achieving up to 4.4× speedup in the forward pass and up to 25× speedup in backward passes compared to standard fixed fan-in sparsity, while maintaining precision@k on large-scale XMC benchmarks.

Key takeaway

For machine learning engineers optimizing extreme multi-label classification models, HASTE offers a significant performance improvement. You should consider implementing group-shared fixed fan-in sparsity and a dense head/sparse tail decomposition to reduce memory-compute bottlenecks. This approach can yield up to 25× faster backward passes and improve precision@k, narrowing the gap to dense models without auxiliary objectives.

Key insights

HASTE optimizes XMC output layers via group-shared sparsity and a dense/sparse decomposition for hardware-aware speedups.

Principles

Group-shared sparsity improves feature reuse.
Decompose output layer for gradient pathway.
Hardware-aware kernels yield practical speedups.

Method

HASTE employs group-shared fixed fan-in sparsity with independent weights, custom CUDA kernels, and a dense head/sparse tail decomposition for XMC output layers.

In practice

Apply group-shared sparsity to XMC.
Implement custom CUDA kernels.
Decompose output layer by label frequency.

Topics

Extreme Multi-label Classification
Sparse Training
GPU Acceleration
CUDA Kernels
Output Layer Optimization
Hardware-Aware AI

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, AI Hardware Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.