Fast & Asynchronous: Drift Your AI, Not Your GPU Bill // Artem Yushkovskiy
Summary
Delivery Hero's AI Ops team developed and open-sourced the Asia framework, a system designed to manage and scale complex, AI-powered pipelines, particularly for near real-time image enhancement tasks. Facing significant rate limiting and cost constraints with traditional API-based and self-hosted synchronous AI solutions, the team shifted to an asynchronous, actor-based architecture. The Asia framework decouples pipeline steps into independent "actors" that process messages containing a route and payload, allowing for dynamic routing and efficient GPU utilization by scaling actors only when active work is present. This approach has enabled throughput limits to be pushed, eliminated rate limits, and reduced costs by self-hosting models on local GPUs within a Kubernetes cluster, moving away from vendor lock-in.
Key takeaway
For AI Architects designing scalable and cost-effective AI inference systems, the Asia framework offers a compelling alternative to traditional batch or synchronous API-based pipelines. You should consider adopting an asynchronous, actor-based approach to decouple complex AI workflows, optimize GPU utilization, and mitigate rate limiting issues. Evaluate Asia for near real-time applications requiring dynamic routing and high throughput, especially in environments with large-scale, fluctuating workloads.
Key insights
Asynchronous actor-based architectures can significantly improve scalability and cost-efficiency for AI inference pipelines.
Principles
- Decouple pipeline steps into independent actors.
- Process messages with dynamic routing and enrichment.
- Scale compute resources based on actual workload.
Method
The Asia framework uses messages with a defined route and payload, processed by independent, self-scaling actors. Actors enrich the payload and pass it to the next step, with routers dynamically altering the message's future path.
In practice
- Implement asynchronous processing for AI inference.
- Utilize Kubernetes for actor deployment and autoscaling.
- Employ a stateful HTTP gateway for sync/async integration.
Topics
- Asynchronous Actors
- Distributed AI Systems
- MLOps
- GPU Optimization
- Kubernetes Autoscaling
Best for: AI Architect, Machine Learning Engineer, MLOps Engineer, Data Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by MLOps.community.