TokenFormer: Unify the Multi-Field and Sequential Recommendation Worlds
Summary
TokenFormer is a novel recommendation architecture designed to unify multi-field categorical features and sequential user behavior dynamics, addressing the "Sequential Collapse Propagation" (SCP) issue where non-sequence fields degrade sequence features. The model introduces a "Bottom-Full-Top-Sliding (BFTS) attention scheme" that uses full self-attention in lower layers and shrinking-window sliding attention in upper layers. Additionally, it incorporates a "Non-Linear Interaction Representation (NLIR)" that applies one-sided non-linear multiplicative transformations to hidden states. Experiments on public benchmarks like KuaiRand-27K and Tencent's advertising platform demonstrate TokenFormer's state-of-the-art performance, with the tiny version outperforming the Transformer baseline by 5.00‰ AUC and HSTU-Ultra by 2.05‰. It also achieves a 4.03% uplift in GMV during online A/B tests in the WeChat Channels advertising system from January to February 2026.
Key takeaway
For AI Engineers and Research Scientists building large-scale recommender systems, TokenFormer offers a robust blueprint for unified modeling. Its BFTS attention and NLIR mechanisms effectively mitigate "Sequential Collapse Propagation," enhancing both accuracy and dimensional robustness. You should consider adopting this architecture to improve performance and efficiency, especially in data-rich industrial environments where it demonstrates sustained scaling benefits.
Key insights
TokenFormer unifies multi-field and sequential recommendation by mitigating "Sequential Collapse Propagation" through novel attention and non-linear interaction mechanisms.
Principles
- Unified modeling of all interaction types is crucial.
- Hierarchical attention scopes improve efficiency and robustness.
- Non-linear multiplicative interactions enhance representation discriminability.
Method
TokenFormer unifies static fields, behavior tokens, and target attributes into a single stream, processed by stacked Unified Interaction Blocks (UIBs) with BFTS attention and NLIR for multiplicative feature interaction.
In practice
- Use BFTS attention for efficient long sequence modeling.
- Implement NLIR to prevent representation collapse.
- Consider a decoupled serving strategy for efficiency.
Topics
- TokenFormer
- Unified Recommendation
- Sequential Collapse Propagation
- BFTS Attention
- Non-Linear Interaction Representation
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, Director of AI/ML
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.