Xiaomi just open-sourced a 1T-parameter model and almost nobody noticed

2026-04-29 · Source: AIModels.fyi - Aimodels.substack.com · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Advanced, quick

Summary

Xiaomi has released MiMo-V2.5-Pro, a Mixture-of-Experts (MoE) large language model, under an MIT license. This model features 1.02 trillion total parameters with 42 billion active per token, placing it in frontier territory at 54 on the Artificial Analysis Intelligence Index. User benchmarks on r/LocalLLaMA indicate it outperforms Opus 4.6 in coding reasoning, agentic work, and decision-making. Its architecture incorporates hybrid attention, utilizing a 6:1 ratio of sliding-window attention (128-token window) to global attention across 70 layers, reducing KV-cache storage by approximately 7x and enabling a 1M-token context window. Additionally, three natively trained multi-token prediction (MTP) modules are integrated, which Xiaomi reports triples inference output speed.

Key takeaway

For AI Architects evaluating large language models for long-context or high-throughput applications, MiMo-V2.5-Pro warrants close examination. Its hybrid attention mechanism and multi-token prediction modules offer significant advantages in managing context windows up to 1M tokens and tripling inference speeds, potentially reducing operational costs and improving responsiveness for your deployments. Consider benchmarking MiMo-V2.5-Pro against existing solutions for specific coding or agentic tasks.

Key insights

Xiaomi's MiMo-V2.5-Pro MoE model achieves frontier performance and efficiency through hybrid attention and multi-token prediction.

Principles

Hybrid attention optimizes KV-cache usage.
Multi-token prediction enhances inference speed.

Method

MiMo-V2.5-Pro employs a 6:1 sliding-window to global attention ratio for long context and integrates natively trained multi-token prediction modules for faster inference.

In practice

Use hybrid attention for long context windows.
Integrate MTP for inference speedups.

Topics

MiMo-V2.5-Pro
Mixture-of-Experts
Hybrid Attention
Multi-token Prediction
Large Language Model Architecture

Best for: NLP Engineer, AI Architect, Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by AIModels.fyi - Aimodels.substack.com.