Taiji: Pareto Optimal Policy Optimization with Semantics-IDs Trade-off for Industrial LLM-Enhanced Recommendation
Summary
Taiji is a novel LLM-as-Enhancer framework designed for industrial recommender systems, addressing challenges in scaling these systems with large language models. It tackles the difficulty of measuring and improving chain-of-thought (CoT) quality in open-domain recommendation during supervised fine-tuning (SFT) by using reverse-engineered reasoning and open-ended rejection sampling. Furthermore, Taiji resolves the reinforcement learning (RL) alignment issue by proposing Pareto Optimal Policy Optimization (POPO), which adaptively adjusts cross-domain reward weights to balance LLM semantic knowledge and collaborative ID features. Deployed on Kuaishou's advertising platform since May 2026, Taiji serves over 400 million users daily, generating significant commercial revenue and demonstrating robust scalability.
Key takeaway
For Machine Learning Engineers developing LLM-enhanced recommender systems, Taiji demonstrates a robust approach to overcome common alignment challenges. You should consider integrating adaptive reward optimization like Pareto Optimal Policy Optimization (POPO) to balance LLM semantic knowledge with user preference signals. Additionally, explore reverse-engineered reasoning and rejection sampling to generate higher-quality chain-of-thought data for supervised fine-tuning, potentially improving recommendation performance and scalability in web-scale environments.
Key insights
Taiji optimizes LLM-enhanced recommendations by balancing semantic and ID-based rewards and improving CoT quality.
Principles
- Aligning LLM semantics with recommender IDs is crucial.
- CoT quality is a bottleneck in open-domain SFT.
- Adaptive reward weighting optimizes cross-domain trade-offs.
Method
Taiji uses reverse-engineered reasoning and open-ended rejection sampling for CoT data, and Pareto Optimal Policy Optimization (POPO) for adaptive cross-domain reward weighting.
In practice
- Generate CoT data via reverse-engineered reasoning.
- Employ rejection sampling for high-quality CoT.
- Use POPO for adaptive reward balancing.
Topics
- Large Language Models
- Recommender Systems
- Pareto Optimization
- Chain-of-Thought
- Reinforcement Learning
- Kuaishou Advertising
Best for: Research Scientist, AI Architect, AI Engineer, AI Scientist, Machine Learning Engineer, Director of AI/ML
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.