๐บ ๐๏ธ Mercury 2: AI that's 10x faster than ChatGPT & Claude
Summary
Inception Labs has launched Mercury 2, a new reasoning model that utilizes diffusion technology, a method distinct from the autoregressive approach used by models like ChatGPT, Claude, and Gemini. Unlike traditional LLMs that generate text one token at a time, Mercury 2 produces entire answers at once and then refines them. This parallel processing enables a throughput of 1,000 tokens per second on NVIDIA Blackwell GPUs, making it approximately 10 times faster than Claude 4.5 Haiku and GPT 5.2 Mini, while maintaining comparable quality. Mercury 2 also offers significantly lower pricing: $0.25 per million input tokens and $0.75 per million output tokens, alongside a 128K context window, full tool use, and JSON output support. This development, along with energy-based models from Logical Intelligence, signals a shift away from the memory-bound, one-token-at-a-time bottleneck prevalent in current large language models.
Key takeaway
For AI/ML Directors evaluating LLM infrastructure, Mercury 2 presents a compelling alternative to traditional autoregressive models. Its diffusion-based architecture offers a 10x increase in throughput and significantly lower costs, making it ideal for latency-sensitive applications like coding IDEs, voice agents, and customer support. You should investigate Mercury 2's performance and cost benefits for your specific use cases, especially where speed and efficiency are critical, and consider its potential to reduce operational expenses.
Key insights
Diffusion models offer a 10x speedup and cost reduction over autoregressive LLMs by processing tokens in parallel.
Principles
- Autoregressive models are memory-bound.
- Diffusion models process tokens in parallel.
- Energy-based models enhance reasoning accuracy.
Method
Mercury 2 generates a complete text answer and then refines it, contrasting with the sequential, one-token-at-a-time generation of traditional autoregressive LLMs.
In practice
- Use Mercury 2 for high-throughput text generation.
- Explore diffusion models for cost-efficient inference.
- Consider energy-based models for verifiable reasoning tasks.
Topics
- Diffusion Models
- Large Language Models
- AI Performance
- Energy-Based Models
- AI Reasoning
Best for: CTO, VP of Engineering/Data, Director of AI/ML, AI Engineer, Machine Learning Engineer, AI Researcher
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by The Neuron.