Decentralized Autoregressive Generation

2026-06-12 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, long

Summary

A theoretical analysis introduces Decentralized Autoregressive Generation, defining the Decentralized Discrete Flow Matching objective where probability generating velocity is a linear combination of expert flows. This work extends Discrete Flow Matching to the discrete time domain and demonstrates that autoregressive generation is a special degenerate case. Experiments validate the equivalence between decentralized and centralized training for multimodal language models (MLLMs) using LLaVA-1.5-7B and InternVL 2.5-1B. Results show expert ensembles achieve near-parity with compute-matched dense baselines. For LLaVA, VQAv2 improved by +1.49, while MME dropped by -33.32. InternVL saw visual grounding improvements (RefCOCO +6-8 points) but a significant MME drop (-55.92). Ablation studies explored the impact of expert count, vision encoder choice, and clustering algorithms on performance.

Key takeaway

For AI Architects or Machine Learning Engineers facing MLLM training scalability or infrastructure limitations, decentralized autoregressive generation provides a robust solution. You can achieve performance comparable to centralized training, as demonstrated with LLaVA and InternVL, while mitigating single-node failure risks. Consider implementing expert-based training with intelligent data partitioning to reduce communication overhead and enable more accessible, collaborative model development.

Key insights

Decentralized autoregressive generation, using expert flows, achieves performance parity with centralized MLLM training.

Principles

Decentralized training can match centralized MLLM performance.
Expert specialization can enhance specific task performance.
Data partitioning impacts expert training load and outcomes.

Method

The approach extends Discrete Flow Matching to discrete time, defining probability generating velocity as a linear combination of expert flows. It partitions image-text datasets using spherical balanced k-means on CLIP features, training experts independently. Inference uses a top-k routing strategy based on cosine similarity.

In practice

Partition MLLM training data into disjoint clusters.
Use pretrained vision encoders for image feature clustering.
Employ top-k routing for efficient inference with experts.

Topics

Decentralized AI Training
Autoregressive Generation
Multimodal Large Language Models
Discrete Flow Matching
Expert Models
Vision Language Models

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Architect

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.