HSTU From Scratch in PyTorch - A complete Walkthrough
Summary
This article presents a comprehensive PyTorch walkthrough for implementing the Hierarchical Sequential Transformer Unit (HSTU) model from scratch. It details the construction of the fused (item, action) input layer, the core HSTU block incorporating SiLU attention and Relative Attention Bias (RAB), and multi-task heads for both retrieval and rating prediction. The guide utilizes the MovieLens-1M dataset, transforming ratings into POSITIVE, NEUTRAL, and NEGATIVE actions, and employs a leave-one-out train/test split. The custom HSTU implementation is benchmarked against the "rectools" library's reference HSTU and SASRec, reporting HR@10 and NDCG@10 scores. Additionally, the post covers the M-FALCON inference cache, demonstrating a 210x speedup, and is designed to train efficiently on a single GPU like a Colab T4.
Key takeaway
For Machine Learning Engineers building sequential recommender systems, this HSTU implementation provides a strong foundation for capturing complex user behaviors. You should consider integrating fused item and action embeddings, alongside SiLU attention and Relative Attention Bias, to enhance model expressiveness. Implementing the M-FALCON inference cache is crucial for achieving significant speedups, such as the demonstrated 210x, making HSTU a viable option for production-scale recommendation engines.
Key insights
HSTU fuses item and action embeddings with time-aware attention for robust sequential recommendation.
Principles
- Fuse item and action embeddings via MLP for richer interaction signals.
- Incorporate relative position and log-spaced time biases in attention.
- Replace softmax with SiLU in attention for engagement intensity preservation.
Method
Process MovieLens-1M ratings into (item, action, time) triples, mapping ratings to POSITIVE/NEUTRAL/NEGATIVE actions. Implement FusedInputEmbedding, RelativeAttentionBias, HSTUBlock, and HSTUEncoder in PyTorch. Train with multi-task retrieval and rating heads.
In practice
- Train HSTU on MovieLens-1M with D=64, 2 layers on a Colab T4.
- Benchmark custom HSTU against "rectools" HSTU and SASRec.
- Implement M-FALCON for 210x inference speedup.
Topics
- HSTU
- PyTorch
- Recommender Systems
- Sequential Recommendation
- Attention Mechanisms
- M-FALCON Inference
Code references
Best for: Machine Learning Engineer, AI Engineer, AI Student
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by MLWhiz: Recs|ML|GenAI.