๐Ÿ˜ธ Diffusion models are coming for text at ~$0.80 per MILLION flat

ยท Source: The Neuron ยท Field: Technology & Digital โ€” Artificial Intelligence & Machine Learning, Software Development & Engineering, Emerging Technologies & Innovation ยท Depth: Intermediate, long

Summary

Inception Labs has launched Mercury 2, a new diffusion LLM designed for text editing rather than generation, operating at 1,196 tokens per second, which is over three times faster than its closest competitors like Claude 4.5 Haiku (~89 tokens/sec) and GPT-5 Mini (~73 tokens/sec). This model starts with a full draft and refines it in parallel, similar to how AI image generators function. Mercury 2 is priced at $0.25 per million input tokens and $0.75 per million output tokens, making its output cheaper than GPT-5 Mini. It ranks 18th out of 134 models on Artificial Analysis's intelligence index, excelling in agentic coding and instruction-following, with support for tool use, 128K context, and OpenAI-compatible structured outputs. The article also highlights other AI developments, including DeepSeek's alleged use of banned NVIDIA Blackwell chips and distilled knowledge from major AI models, Anthropic's new enterprise plugins for Cowork, and significant investments in AI chip startups like MatX ($500M) and a $100B deal between Meta and AMD for Instinct compute.

Key takeaway

For Machine Learning Engineers building agent-based systems, Mercury 2's 10x speed advantage fundamentally changes what is possible by drastically reducing latency in chained AI calls. You should evaluate integrating diffusion LLMs like Mercury 2 into your workflows to enable more natural voice assistants, faster code agents, and background automations that complete tasks efficiently, especially given its competitive pricing and OpenAI-compatible stack.

Key insights

Diffusion models can significantly accelerate text processing by editing full drafts in parallel, rather than generating word-by-word.

Principles

Method

Mercury 2 employs a diffusion LLM approach, starting with a rough sketch of an entire answer and then refining it all at once, mimicking an editor's parallel revision process for text and reasoning tasks.

In practice

Topics

Code references

Best for: CTO, Machine Learning Engineer, NLP Engineer, AI Engineer, AI Product Manager, Investor

Related on AIssential

Open in AIssential โ†’

Editorial summary, takeaway, and curation by AIssential. Original article published by The Neuron.