M*: A Modular, Extensible, Serving System for Multimodal Models
Summary
M* is introduced as a universal serving system designed for the efficient deployment of composite AI models, addressing the limitations of existing frameworks built on narrow assumptions about model structure. These new architectures integrate diverse components like vision encoders, language backbones, and diffusion heads, underpinning unified multimodal models, speech-language models, and robotic planning policies. M* represents models as dataflow graphs, processing requests as traversals over these graphs. Its core innovation is a modular abstraction called the Walk Graph, enabling arbitrary composition of model components, flexible placement on physical clusters, and model-agnostic optimizations. Benchmarking shows M* achieves 20% lower end-to-end latency than vLLM-Omni for text-to-image workloads on BAGEL. It also delivers up to 2.9x lower real-time factor and 2.7x higher throughput for text-to-speech on Qwen3-Omni, and up to 12.5x better performance than V-JEPA 2-AC for robotic planning.
Key takeaway
For AI Architects designing infrastructure for composite multimodal models, M* offers a significant performance advantage over existing serving frameworks. You should evaluate M* to reduce end-to-end latency for text-to-image tasks and achieve higher throughput for text-to-speech and robotic planning, streamlining deployment efforts for complex architectures.
Key insights
M* efficiently serves diverse composite multimodal AI models using a modular, graph-based system for flexible component composition and optimization.
Principles
- Model components can be arbitrarily composed.
- Flexible placement on physical clusters.
- Model-agnostic optimizations are crucial.
Method
M* represents composite models as dataflow graphs, processing requests via graph traversals. Its Walk Graph abstraction enables arbitrary component composition, flexible cluster placement, and model-agnostic optimizations.
In practice
- Deploy M* for text-to-image tasks.
- Optimize text-to-speech workloads.
- Enhance robotic planning performance.
Topics
- Multimodal Models
- Model Serving Systems
- Composite AI Architectures
- Dataflow Graphs
- Performance Optimization
- Robotic Planning
Best for: MLOps Engineer, NLP Engineer, Computer Vision Engineer, AI Engineer, Machine Learning Engineer, AI Architect
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.