HSTU From Scratch in PyTorch - A complete Walkthrough

2026-05-28 · Source: MLWhiz: Recs|ML|GenAI · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Data Science & Analytics · Depth: Intermediate, medium

Summary

This article presents a comprehensive PyTorch walkthrough for implementing the Hierarchical Sequential Transformer Unit (HSTU) model from scratch. It details the construction of the fused (item, action) input layer, the core HSTU block incorporating SiLU attention and Relative Attention Bias (RAB), and multi-task heads for both retrieval and rating prediction. The guide utilizes the MovieLens-1M dataset, transforming ratings into POSITIVE, NEUTRAL, and NEGATIVE actions, and employs a leave-one-out train/test split. The custom HSTU implementation is benchmarked against the "rectools" library's reference HSTU and SASRec, reporting HR@10 and NDCG@10 scores. Additionally, the post covers the M-FALCON inference cache, demonstrating a 210x speedup, and is designed to train efficiently on a single GPU like a Colab T4.

Key takeaway

For Machine Learning Engineers building sequential recommender systems, this HSTU implementation provides a strong foundation for capturing complex user behaviors. You should consider integrating fused item and action embeddings, alongside SiLU attention and Relative Attention Bias, to enhance model expressiveness. Implementing the M-FALCON inference cache is crucial for achieving significant speedups, such as the demonstrated 210x, making HSTU a viable option for production-scale recommendation engines.

Key insights

HSTU fuses item and action embeddings with time-aware attention for robust sequential recommendation.

Principles

Fuse item and action embeddings via MLP for richer interaction signals.
Incorporate relative position and log-spaced time biases in attention.
Replace softmax with SiLU in attention for engagement intensity preservation.

Method

Process MovieLens-1M ratings into (item, action, time) triples, mapping ratings to POSITIVE/NEUTRAL/NEGATIVE actions. Implement FusedInputEmbedding, RelativeAttentionBias, HSTUBlock, and HSTUEncoder in PyTorch. Train with multi-task retrieval and rating heads.

In practice

Train HSTU on MovieLens-1M with D=64, 2 layers on a Colab T4.
Benchmark custom HSTU against "rectools" HSTU and SASRec.
Implement M-FALCON for 210x inference speedup.

Topics

HSTU
PyTorch
Recommender Systems
Sequential Recommendation
Attention Mechanisms
M-FALCON Inference

Code references

MobileTeleSystems/RecTools

Best for: Machine Learning Engineer, AI Engineer, AI Student

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by MLWhiz: Recs|ML|GenAI.