😺 🎙️ Mercury 2: AI that's 10x faster than ChatGPT & Claude

2026-02-24 · Source: The Neuron · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Software Development & Engineering · Depth: Intermediate, long

Summary

Inception Labs has launched Mercury 2, a new reasoning model that utilizes diffusion technology, a method distinct from the autoregressive approach used by models like ChatGPT, Claude, and Gemini. Unlike traditional LLMs that generate text one token at a time, Mercury 2 produces entire answers at once and then refines them. This parallel processing enables a throughput of 1,000 tokens per second on NVIDIA Blackwell GPUs, making it approximately 10 times faster than Claude 4.5 Haiku and GPT 5.2 Mini, while maintaining comparable quality. Mercury 2 also offers significantly lower pricing: $0.25 per million input tokens and $0.75 per million output tokens, alongside a 128K context window, full tool use, and JSON output support. This development, along with energy-based models from Logical Intelligence, signals a shift away from the memory-bound, one-token-at-a-time bottleneck prevalent in current large language models.

Key takeaway

For AI/ML Directors evaluating LLM infrastructure, Mercury 2 presents a compelling alternative to traditional autoregressive models. Its diffusion-based architecture offers a 10x increase in throughput and significantly lower costs, making it ideal for latency-sensitive applications like coding IDEs, voice agents, and customer support. You should investigate Mercury 2's performance and cost benefits for your specific use cases, especially where speed and efficiency are critical, and consider its potential to reduce operational expenses.

Key insights

Diffusion models offer a 10x speedup and cost reduction over autoregressive LLMs by processing tokens in parallel.

Principles

Autoregressive models are memory-bound.
Diffusion models process tokens in parallel.
Energy-based models enhance reasoning accuracy.

Method

Mercury 2 generates a complete text answer and then refines it, contrasting with the sequential, one-token-at-a-time generation of traditional autoregressive LLMs.

In practice

Use Mercury 2 for high-throughput text generation.
Explore diffusion models for cost-efficient inference.
Consider energy-based models for verifiable reasoning tasks.

Topics

Diffusion Models
Large Language Models
AI Performance
Energy-Based Models
AI Reasoning

Best for: CTO, VP of Engineering/Data, Director of AI/ML, AI Engineer, Machine Learning Engineer, AI Researcher

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by The Neuron.