M*: A Modular, Extensible, Serving System for Multimodal Models

2026-03-20 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cloud Computing & IT Infrastructure, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

M* is a novel, universal serving system designed for the efficient deployment of composite multimodal AI models, which integrate diverse components like vision encoders, language backbones, and audio codecs. Unlike existing frameworks built on narrow assumptions, M* represents models as dataflow graphs, processing requests as traversals over these graphs. This modular abstraction, called the Walk Graph, supports arbitrary component composition, flexible placement on physical clusters, and model-agnostic optimizations. Benchmarking shows M* achieves, on average, 20% lower end-to-end latency than vLLM-Omni for text-to-image workloads on BAGEL, delivers up to 2.9x lower real-time factor and 2.7x higher throughput for text-to-speech on Qwen3-Omni, and outperforms the V-JEPA 2-AC rollout baseline for robotic planning by up to 12.5x.

Key takeaway

For MLOps Engineers deploying complex multimodal models, M* offers a robust solution to overcome the limitations of traditional LLM serving frameworks. Its dataflow graph abstraction and flexible runtime enable significantly improved performance, including lower latency and higher throughput, across diverse tasks like text-to-image generation and robotic planning. You should consider M* to streamline the deployment and optimize the inference efficiency of your next-generation composite AI systems.

Key insights

M* serves composite multimodal models efficiently by abstracting them as dataflow graphs with request-specific traversals.

Principles

Every multimodal model is a dataflow graph of heterogeneous components.
Inference for composite models no longer reduces to a single autoregressive loop.
Decoupling model architecture from runtime enables flexible placement and optimizations.

Method

M* defines models as computation graphs with named "Walks" using Sequential, Parallel, Loop, and DynamicLoop primitives. It employs streaming edges with ChunkPolicies and a distributed runtime for execution.

In practice

Define complex multimodal model architectures using M*'s Walk Graph primitives.
Specify flexible GPU placements for components to maximize hardware utilization.
Implement streaming data transfer between components using customizable ChunkPolicies.

Topics

Multimodal Models
Model Serving Systems
Dataflow Graphs
LLM Inference
GPU Optimization
Real-time AI

Code references

Best for: AI Architect, AI Scientist, Research Scientist, AI Engineer, Machine Learning Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.