M*: A Modular, Extensible, Serving System for Multimodal Models
Summary
M* is a novel, universal serving system designed for the efficient deployment of composite multimodal AI models, which integrate diverse components like vision encoders, language backbones, and audio codecs. Unlike existing frameworks built on narrow assumptions, M* represents models as dataflow graphs, processing requests as traversals over these graphs. This modular abstraction, called the Walk Graph, supports arbitrary component composition, flexible placement on physical clusters, and model-agnostic optimizations. Benchmarking shows M* achieves, on average, 20% lower end-to-end latency than vLLM-Omni for text-to-image workloads on BAGEL, delivers up to 2.9x lower real-time factor and 2.7x higher throughput for text-to-speech on Qwen3-Omni, and outperforms the V-JEPA 2-AC rollout baseline for robotic planning by up to 12.5x.
Key takeaway
For MLOps Engineers deploying complex multimodal models, M* offers a robust solution to overcome the limitations of traditional LLM serving frameworks. Its dataflow graph abstraction and flexible runtime enable significantly improved performance, including lower latency and higher throughput, across diverse tasks like text-to-image generation and robotic planning. You should consider M* to streamline the deployment and optimize the inference efficiency of your next-generation composite AI systems.
Key insights
M* serves composite multimodal models efficiently by abstracting them as dataflow graphs with request-specific traversals.
Principles
- Every multimodal model is a dataflow graph of heterogeneous components.
- Inference for composite models no longer reduces to a single autoregressive loop.
- Decoupling model architecture from runtime enables flexible placement and optimizations.
Method
M* defines models as computation graphs with named "Walks" using Sequential, Parallel, Loop, and DynamicLoop primitives. It employs streaming edges with ChunkPolicies and a distributed runtime for execution.
In practice
- Define complex multimodal model architectures using M*'s Walk Graph primitives.
- Specify flexible GPU placements for components to maximize hardware utilization.
- Implement streaming data transfer between components using customizable ChunkPolicies.
Topics
- Multimodal Models
- Model Serving Systems
- Dataflow Graphs
- LLM Inference
- GPU Optimization
- Real-time AI
Code references
Best for: AI Architect, AI Scientist, Research Scientist, AI Engineer, Machine Learning Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.