CTR-Sink: Attention Sink for Language Models in Click-Through Rate Prediction
Summary
CTR-Sink is a novel framework addressing semantic fragmentation in Language Model (LM)-based Click-Through Rate (CTR) prediction for user behavior sequences. LMs struggle with discrete user actions and semantically empty separators, causing attention to scatter. CTR-Sink introduces behavior-level attention sinks by inserting special tokens, fused with recommendation-specific signals like temporal distance, between consecutive behaviors. The framework employs a two-stage training strategy to guide LM attention towards these sink tokens and a sink-specific attention mechanism to amplify inter-sink dependencies. Experiments on an industrial dataset and open-source datasets (MovieLens, Kuairec) demonstrate consistent AUC improvements. For RoBERTa, gains were 0.46% (industrial), 0.36% (MovieLens), and 0.59% (Kuairec); for Qwen, improvements were 0.34%, 0.26%, and 0.42% respectively, validating its effectiveness across encoder and decoder architectures.
Key takeaway
For Machine Learning Engineers optimizing Click-Through Rate prediction with Language Models, particularly when handling long user behavior sequences, you should consider adopting the CTR-Sink framework. Its method of inserting recommendation-specific sink tokens and employing a two-stage training strategy directly addresses semantic fragmentation, improving attention focus and inter-behavior correlation modeling. This approach consistently boosts AUC by 0.2-0.5% across diverse datasets and LM architectures, offering a robust solution for enhancing your recommendation system's accuracy and adaptability to extended user histories.
Key insights
CTR-Sink mitigates semantic fragmentation in LM-based CTR prediction by introducing behavior-level attention sinks for discrete user sequences.
Principles
- User behavior sequences require explicit attention sinks for LMs.
- Recommendation-specific signals improve attention sink efficacy.
- Two-stage training enhances decoder LM attention to sink tokens.
Method
Insert MLP-embedded [SINK] tokens, incorporating temporal distance, between behaviors. Employ a two-stage training: first, predict using only [SINK] tokens, then optimize with full sequence. Strengthen inter-sink attention via a bias matrix.
In practice
- Integrate temporal distance or semantic similarity into custom sink tokens.
- Implement a two-stage training approach for decoder-based LMs.
- Apply a sink-specific attention mechanism to model inter-behavior associations.
Topics
- Click-Through Rate Prediction
- Language Models
- Attention Sink
- Recommendation Systems
- User Behavior Modeling
- RoBERTa
- Qwen
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Data Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.