JetBrains Releases Mellum2: A 12B MoE Model for Fast, Specialized Tasks in Multi-Model AI Pipelines
Summary
JetBrains has open-sourced Mellum2, a 12B Mixture-of-Experts (MoE) model designed for fast, specialized tasks within multi-model AI pipelines. While it has 12B total parameters, only 2.5B are active per token, making its per-token compute equivalent to a 2.5B dense model. This "focal model" philosophy positions Mellum2 for high-frequency, latency-sensitive roles like routing, summarization, and validation, complementing larger frontier models. Its architecture includes a Multi-Token Prediction head for speculative decoding and supports a 131,072 token context window. Training involved ~10.6 trillion tokens across three phases, utilizing the Muon optimizer under FP8 hybrid precision, with context extended to 128K via layer-selective YaRN. Released under an Apache 2.0 license, Mellum2 offers six checkpoints, including SFT and RL-tuned variants, and supports vLLM with tool-calling. Benchmarks show strong performance on EvalPlus (78.4) and BFCL v3 (66.3) against models up to 14B, aligning with its specialized component role rather than general-purpose leadership.
Key takeaway
For AI Engineers designing multi-model systems, Mellum2 offers a compelling option to optimize pipeline efficiency and cost. If you are struggling with latency or resource consumption for high-frequency tasks like routing or summarization, consider integrating this 12B MoE model. Its "focal model" design allows you to offload specialized work from larger, more expensive frontier models, improving overall system performance and reducing inference costs. Explore its Apache 2.0 licensed checkpoints for immediate deployment or fine-tuning.
Key insights
Mellum2 demonstrates a "focal model" approach, optimizing smaller MoE models for specific, fast tasks in AI pipelines.
Principles
- Not all AI pipeline steps need frontier models.
- Specialized MoE models enhance efficiency for specific tasks.
- Speculative decoding can be integrated via architecture.
Method
Mellum2's training involved a three-phase curriculum over ~10.6 trillion tokens, using Muon optimizer with FP8 precision, followed by SFT and RLVR post-training.
In practice
- Deploy Mellum2 for routing or summarization.
- Integrate into multi-model AI pipelines.
- Fine-tune for custom specialized tasks.
Topics
- Mixture-of-Experts
- AI Pipelines
- Specialized Models
- Model Optimization
- Open-Source AI
- vLLM
Best for: AI Architect, MLOps Engineer, NLP Engineer, Machine Learning Engineer, AI Engineer, AI Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning ML & Generative AI News.