Deploying a Multistage Multimodal Recommender System on Amazon Elastic Kubernetes Service

2026-05-19 · Source: Towards Data Science · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cloud Computing & IT Infrastructure, Software Development & Engineering · Depth: Advanced, extended

Summary

The article details the design and deployment of a production-ready multistage, multimodal recommender system on Amazon Elastic Kubernetes Service (EKS). This system, built for an e-commerce platform with 300,000 users and 2,465 items, processes 13 million interactions over 14 days. It features a four-stage architecture: a Two-Tower model for candidate generation, a Bloom filter for seen-item exclusion, a DLRM ranker using tabular, CLIP image, and Sentence-BERT text embeddings, and a final reranking stage. The MLOps pipeline uses Kubeflow for full training and daily fine-tuning, NVIDIA Merlin for GPU-accelerated tasks, FAISS for ANN indexing, and Feast for feature management. Key optimizations include in-memory item feature caching, which reduced "feast_item_lookup" latency by 99.7% and end-to-end latency by 54%, and autoscaling with Kubernetes HPA and Karpenter.

Key takeaway

For MLOps Engineers deploying large-scale recommender systems, prioritize a multistage architecture to manage catalog growth and integrate in-memory feature caching to significantly reduce inference latency. You should also implement Kubeflow pipelines for both initial deployment and efficient daily fine-tuning, ensuring models adapt to new user interactions without full retraining. Utilize Kubernetes HPA with custom metrics for dynamic autoscaling to handle varying request loads effectively.

Key insights

A robust multistage, multimodal recommender system can be deployed on EKS, integrating diverse models and MLOps for scalability and real-time adaptation.

Principles

Multistage design scales recommendations for large catalogs.
In-memory caching drastically cuts feature lookup latency.
Daily fine-tuning updates models without full rebuilds.

Method

The system uses Kubeflow pipelines for full training/deployment and daily fine-tuning. NVIDIA Triton Inference Server hosts 14 models in an ensemble DAG, with FAISS for retrieval and Feast for feature stores. Autoscaling is managed by Kubernetes HPA and Karpenter.

In practice

Implement Bloom filters to exclude previously seen items.
Cache item features in-memory to reduce lookup latency.
Use feature masking during training to address cold-start.

Topics

Multistage Recommenders
Multimodal Embeddings
Amazon EKS
Kubeflow MLOps
Triton Inference Server
Feature Caching

Code references

Best for: Machine Learning Engineer, MLOps Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Towards Data Science.