JetBrains Releases Mellum2: A 12B MoE Model for Fast, Specialized Tasks in Multi-Model AI Pipelines

2026-06-02 · Source: Machine Learning ML & Generative AI News · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Advanced, quick

Summary

JetBrains has open-sourced Mellum2, a 12B Mixture-of-Experts (MoE) model designed for fast, specialized tasks within multi-model AI pipelines. While it has 12B total parameters, only 2.5B are active per token, making its per-token compute equivalent to a 2.5B dense model. This "focal model" philosophy positions Mellum2 for high-frequency, latency-sensitive roles like routing, summarization, and validation, complementing larger frontier models. Its architecture includes a Multi-Token Prediction head for speculative decoding and supports a 131,072 token context window. Training involved ~10.6 trillion tokens across three phases, utilizing the Muon optimizer under FP8 hybrid precision, with context extended to 128K via layer-selective YaRN. Released under an Apache 2.0 license, Mellum2 offers six checkpoints, including SFT and RL-tuned variants, and supports vLLM with tool-calling. Benchmarks show strong performance on EvalPlus (78.4) and BFCL v3 (66.3) against models up to 14B, aligning with its specialized component role rather than general-purpose leadership.

Key takeaway

For AI Engineers designing multi-model systems, Mellum2 offers a compelling option to optimize pipeline efficiency and cost. If you are struggling with latency or resource consumption for high-frequency tasks like routing or summarization, consider integrating this 12B MoE model. Its "focal model" design allows you to offload specialized work from larger, more expensive frontier models, improving overall system performance and reducing inference costs. Explore its Apache 2.0 licensed checkpoints for immediate deployment or fine-tuning.

Key insights

Mellum2 demonstrates a "focal model" approach, optimizing smaller MoE models for specific, fast tasks in AI pipelines.

Principles

Not all AI pipeline steps need frontier models.
Specialized MoE models enhance efficiency for specific tasks.
Speculative decoding can be integrated via architecture.

Method

Mellum2's training involved a three-phase curriculum over ~10.6 trillion tokens, using Muon optimizer with FP8 precision, followed by SFT and RLVR post-training.

In practice

Deploy Mellum2 for routing or summarization.
Integrate into multi-model AI pipelines.
Fine-tune for custom specialized tasks.

Topics

Mixture-of-Experts
AI Pipelines
Specialized Models
Model Optimization
Open-Source AI
vLLM

Best for: AI Architect, MLOps Engineer, NLP Engineer, Machine Learning Engineer, AI Engineer, AI Scientist

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning ML & Generative AI News.