Deploying a Multistage Multimodal Recommender System on Amazon Elastic Kubernetes Service
Summary
The article details the design and deployment of a production-ready multistage, multimodal recommender system on Amazon Elastic Kubernetes Service (EKS). This system, built for an e-commerce platform with 300,000 users and 2,465 items, processes 13 million interactions over 14 days. It features a four-stage architecture: a Two-Tower model for candidate generation, a Bloom filter for seen-item exclusion, a DLRM ranker using tabular, CLIP image, and Sentence-BERT text embeddings, and a final reranking stage. The MLOps pipeline uses Kubeflow for full training and daily fine-tuning, NVIDIA Merlin for GPU-accelerated tasks, FAISS for ANN indexing, and Feast for feature management. Key optimizations include in-memory item feature caching, which reduced "feast_item_lookup" latency by 99.7% and end-to-end latency by 54%, and autoscaling with Kubernetes HPA and Karpenter.
Key takeaway
For MLOps Engineers deploying large-scale recommender systems, prioritize a multistage architecture to manage catalog growth and integrate in-memory feature caching to significantly reduce inference latency. You should also implement Kubeflow pipelines for both initial deployment and efficient daily fine-tuning, ensuring models adapt to new user interactions without full retraining. Utilize Kubernetes HPA with custom metrics for dynamic autoscaling to handle varying request loads effectively.
Key insights
A robust multistage, multimodal recommender system can be deployed on EKS, integrating diverse models and MLOps for scalability and real-time adaptation.
Principles
- Multistage design scales recommendations for large catalogs.
- In-memory caching drastically cuts feature lookup latency.
- Daily fine-tuning updates models without full rebuilds.
Method
The system uses Kubeflow pipelines for full training/deployment and daily fine-tuning. NVIDIA Triton Inference Server hosts 14 models in an ensemble DAG, with FAISS for retrieval and Feast for feature stores. Autoscaling is managed by Kubernetes HPA and Karpenter.
In practice
- Implement Bloom filters to exclude previously seen items.
- Cache item features in-memory to reduce lookup latency.
- Use feature masking during training to address cold-start.
Topics
- Multistage Recommenders
- Multimodal Embeddings
- Amazon EKS
- Kubeflow MLOps
- Triton Inference Server
- Feature Caching
Code references
- facebookresearch/faiss
- aws-samples/retail-demo-store
- MustaphaU/multistage-recommender-system-on-kubernetes
- NVIDIA-Merlin/dataloader
Best for: Machine Learning Engineer, MLOps Engineer, AI Architect
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Towards Data Science.