2026 Predictions: Much Faster Inference, Pre-Training with RL, and FP4 Everywhere
Summary
The article reviews key developments in large language models (LLMs) during 2025, highlighting DeepSeek's impact and the widespread adoption of FP4 quantization. DeepSeek-V3, released in late 2024, introduced an "efficiently big" sparse Mixture-of-Experts (MoE) architecture with 685B total parameters and 37B activated per token, trained on 14.8T tokens, influencing models like GLM and MiniMax. DeepSeek-R1, released in January 2025 with an MIT license, popularized Reinforcement Learning with Verifiable Rewards (RLVR) and GRPO, fostering an ecosystem for "small reasoners" through distillation and synthetic data. Concurrently, FP4 quantization, particularly NVIDIA's NVFP4, became standard with Blackwell-era hardware, enabling nearly 2x faster inference without significant accuracy loss. OpenAI also released "natively" MXFP4 quantized models, gpt-oss-120b and gpt-oss-20b, emphasizing efficient deployment. In contrast, Meta's Llama 4 was deemed a disappointment due to poor performance and a messy rollout, despite ambitious engineering choices like massive context lengths.
Key takeaway
For MLOps engineers optimizing LLM deployment, prioritize hardware supporting native FP4 quantization like NVIDIA's Blackwell-era GPUs to achieve significant inference speedups. While DeepSeek-R1's RLVR/GRPO approach offers powerful reasoning capabilities, be aware of its current brittleness and infrastructure sensitivity. Your focus should be on leveraging efficient architectures and quantization for cost-effective, high-performance LLM operations, rather than investing in models like Llama 4 that have demonstrated poor practical performance.
Key insights
DeepSeek's MoE and RLVR advancements, alongside FP4 quantization, defined 2025 LLM progress, while Llama 4 disappointed.
Principles
- Sparse MoE enables efficiently large models.
- RLVR/GRPO can teach models to "reason."
- FP4 quantization significantly boosts inference speed.
Method
DeepSeek-R1's approach combines Reinforcement Learning with Verifiable Rewards (RLVR) and GRPO for post-training, enabling models to "reason" through distillation and synthetic traces, with tooling support from Hugging Face TRL and Unsloth.
In practice
- Use NVFP4 for 2x faster LLM inference.
- Explore RLVR/GRPO for reasoning capabilities.
- Consider small models for specific translation tasks.
Topics
- DeepSeek Models
- LLM Quantization
- Reinforcement Learning
- Test-time Scaling
- Machine Translation
Best for: MLOps Engineer, NLP Engineer, CTO, Machine Learning Engineer, AI Engineer, AI Product Manager
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by The Kaitchup – AI on a Budget.