Serving CTR Recommendation Models with Triton Inference Server using the ONNX Runtime Backend
Summary
The latest ROCm Triton Inference Server release, aligned with upstream version 25.12, introduces ONNX Runtime and Python backend support for AMD GPUs. This upgrade expands AI model deployment capabilities, allowing users to serve ONNX models optimized via ONNX Runtime and implement custom inference logic directly in Python for preprocessing, post-processing, and model orchestration. Benchmarks using the FinalNet click-through rate (CTR) prediction model demonstrate significant throughput advantages for the AMD Instinct MI355X GPU over the NVIDIA B200 Tensor Core GPU. Specifically, the MI355X achieved 175.1% higher throughput at concurrency 7, 128.7% at concurrency 23, and 122.8% at concurrency 47, confirming its readiness for production AI workloads.
Key takeaway
For AI Engineers and CTOs evaluating inference serving platforms, the updated ROCm Triton Inference Server with ONNX Runtime and Python backend support on AMD Instinct MI355X GPUs offers a compelling, high-performance option. Your teams can now deploy a broader range of models, including custom Python pipelines, with demonstrated throughput advantages over comparable NVIDIA hardware. Consider integrating this solution for scalable, production-ready AI inference, especially for demanding workloads like CTR prediction.
Key insights
ROCm Triton Inference Server now supports ONNX Runtime and Python backends, enhancing AMD GPU deployment flexibility and performance.
Principles
- Unified inference interface is crucial for diverse AI deployments.
- Optimized runtimes expand model serving capabilities.
- Custom logic integration improves pipeline flexibility.
Method
Deploy ONNX models by placing them in a versioned model directory with a `config.pbtxt` file, or implement Python inference logic in a `model.py` file with a `config.pbtxt` for custom pipelines.
In practice
- Export PyTorch/TensorFlow models to ONNX for serving.
- Implement pre/post-processing in Python backend.
- Utilize dynamic batching for recommendation models.
Topics
- Triton Inference Server
- ONNX Runtime Backend
- Python Backend
- AMD Instinct MI355X
- CTR Recommendation Models
Code references
Best for: AI Engineer, CTO, VP of Engineering/Data, Machine Learning Engineer, MLOps Engineer, AI Architect
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by AMD ROCm Blogs.