Fast & Asynchronous: Drift Your AI, Not Your GPU Bill // Artem Yushkovskiy

2026-02-19 · Source: MLOps.community · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cloud Computing & IT Infrastructure, Software Development & Engineering · Depth: Advanced, extended

Summary

Delivery Hero's AI Ops team developed and open-sourced the Asia framework, a system designed to manage and scale complex, AI-powered pipelines, particularly for near real-time image enhancement tasks. Facing significant rate limiting and cost constraints with traditional API-based and self-hosted synchronous AI solutions, the team shifted to an asynchronous, actor-based architecture. The Asia framework decouples pipeline steps into independent "actors" that process messages containing a route and payload, allowing for dynamic routing and efficient GPU utilization by scaling actors only when active work is present. This approach has enabled throughput limits to be pushed, eliminated rate limits, and reduced costs by self-hosting models on local GPUs within a Kubernetes cluster, moving away from vendor lock-in.

Key takeaway

For AI Architects designing scalable and cost-effective AI inference systems, the Asia framework offers a compelling alternative to traditional batch or synchronous API-based pipelines. You should consider adopting an asynchronous, actor-based approach to decouple complex AI workflows, optimize GPU utilization, and mitigate rate limiting issues. Evaluate Asia for near real-time applications requiring dynamic routing and high throughput, especially in environments with large-scale, fluctuating workloads.

Key insights

Asynchronous actor-based architectures can significantly improve scalability and cost-efficiency for AI inference pipelines.

Principles

Decouple pipeline steps into independent actors.
Process messages with dynamic routing and enrichment.
Scale compute resources based on actual workload.

Method

The Asia framework uses messages with a defined route and payload, processed by independent, self-scaling actors. Actors enrich the payload and pass it to the next step, with routers dynamically altering the message's future path.

In practice

Implement asynchronous processing for AI inference.
Utilize Kubernetes for actor deployment and autoscaling.
Employ a stateful HTTP gateway for sync/async integration.

Topics

Asynchronous Actors
Distributed AI Systems
MLOps
GPU Optimization
Kubernetes Autoscaling

Best for: AI Architect, Machine Learning Engineer, MLOps Engineer, Data Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by MLOps.community.