Decentralized Autoregressive Generation
Summary
A theoretical analysis introduces Decentralized Autoregressive Generation, defining the Decentralized Discrete Flow Matching objective where probability generating velocity is a linear combination of expert flows. This work extends Discrete Flow Matching to the discrete time domain and demonstrates that autoregressive generation is a special degenerate case. Experiments validate the equivalence between decentralized and centralized training for multimodal language models (MLLMs) using LLaVA-1.5-7B and InternVL 2.5-1B. Results show expert ensembles achieve near-parity with compute-matched dense baselines. For LLaVA, VQAv2 improved by +1.49, while MME dropped by -33.32. InternVL saw visual grounding improvements (RefCOCO +6-8 points) but a significant MME drop (-55.92). Ablation studies explored the impact of expert count, vision encoder choice, and clustering algorithms on performance.
Key takeaway
For AI Architects or Machine Learning Engineers facing MLLM training scalability or infrastructure limitations, decentralized autoregressive generation provides a robust solution. You can achieve performance comparable to centralized training, as demonstrated with LLaVA and InternVL, while mitigating single-node failure risks. Consider implementing expert-based training with intelligent data partitioning to reduce communication overhead and enable more accessible, collaborative model development.
Key insights
Decentralized autoregressive generation, using expert flows, achieves performance parity with centralized MLLM training.
Principles
- Decentralized training can match centralized MLLM performance.
- Expert specialization can enhance specific task performance.
- Data partitioning impacts expert training load and outcomes.
Method
The approach extends Discrete Flow Matching to discrete time, defining probability generating velocity as a linear combination of expert flows. It partitions image-text datasets using spherical balanced k-means on CLIP features, training experts independently. Inference uses a top-k routing strategy based on cosine similarity.
In practice
- Partition MLLM training data into disjoint clusters.
- Use pretrained vision encoders for image feature clustering.
- Employ top-k routing for efficient inference with experts.
Topics
- Decentralized AI Training
- Autoregressive Generation
- Multimodal Large Language Models
- Discrete Flow Matching
- Expert Models
- Vision Language Models
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Architect
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.