2026 Predictions: Much Faster Inference, Pre-Training with RL, and FP4 Everywhere

2025-07-07 · Source: The Kaitchup – AI on a Budget · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Cloud Computing & IT Infrastructure · Depth: Advanced, medium

Summary

The article reviews key developments in large language models (LLMs) during 2025, highlighting DeepSeek's impact and the widespread adoption of FP4 quantization. DeepSeek-V3, released in late 2024, introduced an "efficiently big" sparse Mixture-of-Experts (MoE) architecture with 685B total parameters and 37B activated per token, trained on 14.8T tokens, influencing models like GLM and MiniMax. DeepSeek-R1, released in January 2025 with an MIT license, popularized Reinforcement Learning with Verifiable Rewards (RLVR) and GRPO, fostering an ecosystem for "small reasoners" through distillation and synthetic data. Concurrently, FP4 quantization, particularly NVIDIA's NVFP4, became standard with Blackwell-era hardware, enabling nearly 2x faster inference without significant accuracy loss. OpenAI also released "natively" MXFP4 quantized models, gpt-oss-120b and gpt-oss-20b, emphasizing efficient deployment. In contrast, Meta's Llama 4 was deemed a disappointment due to poor performance and a messy rollout, despite ambitious engineering choices like massive context lengths.

Key takeaway

For MLOps engineers optimizing LLM deployment, prioritize hardware supporting native FP4 quantization like NVIDIA's Blackwell-era GPUs to achieve significant inference speedups. While DeepSeek-R1's RLVR/GRPO approach offers powerful reasoning capabilities, be aware of its current brittleness and infrastructure sensitivity. Your focus should be on leveraging efficient architectures and quantization for cost-effective, high-performance LLM operations, rather than investing in models like Llama 4 that have demonstrated poor practical performance.

Key insights

DeepSeek's MoE and RLVR advancements, alongside FP4 quantization, defined 2025 LLM progress, while Llama 4 disappointed.

Principles

Sparse MoE enables efficiently large models.
RLVR/GRPO can teach models to "reason."
FP4 quantization significantly boosts inference speed.

Method

DeepSeek-R1's approach combines Reinforcement Learning with Verifiable Rewards (RLVR) and GRPO for post-training, enabling models to "reason" through distillation and synthetic traces, with tooling support from Hugging Face TRL and Unsloth.

In practice

Use NVFP4 for 2x faster LLM inference.
Explore RLVR/GRPO for reasoning capabilities.
Consider small models for specific translation tasks.

Topics

DeepSeek Models
LLM Quantization
Reinforcement Learning
Test-time Scaling
Machine Translation

Best for: MLOps Engineer, NLP Engineer, CTO, Machine Learning Engineer, AI Engineer, AI Product Manager

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by The Kaitchup – AI on a Budget.