Diffusion Models Hit 1009 Tokens/Sec
💡Diffusion at 1009 tps beats autoregressive – Nvidia/MS invest, paradigm shift incoming!
⚡ 30-Second TL;DR
What Changed
Diffusion models generate 1009 tokens per second
Why It Matters
This breakthrough could end reliance on slow autoregressive generation, enabling real-time complex reasoning in LLMs. Investments signal industry shift toward diffusion for scalable inference.
What To Do Next
Implement diffusion-based generation from arXiv papers like Diffusion-LM to benchmark against autoregressive baselines.
🧠 Deep Insight
Web-grounded analysis with 8 cited sources.
🔑 Enhanced Key Takeaways
- •Inception Labs, founded by Stanford professor Stefano Ermon who contributed to Stable Diffusion and DALL-E, released Mercury 2 on February 24, 2026, backed by a $50M seed round from Menlo Ventures.[1][5]
- •Mercury 2 features a 128,000 token context window, native tool use, tunable inference, and schema-compliant JSON output, targeting latency-sensitive applications like voice AI, code editors, and agentic loops.[3][4][5]
- •The model achieves end-to-end latency of 1.7 seconds, undercutting competitors on benchmarks such as AIME 2025 (91.1), GPQA (73.6), and LiveCodeBench (67.3).[6][7]
📊 Competitor Analysis▸ Show
| Model | Speed (tokens/sec) | Input Price ($/M) | Output Price ($/M) | Latency (sec) | Context Length |
|---|---|---|---|---|---|
| Mercury 2 | 1009 | 0.25 | 0.75 | 1.7 | 128K |
| Claude Haiku 4.5 | ~89 | 1.00 | 5.00 | 23.4 | N/A |
| GPT-5 Mini | ~71 | N/A | N/A | N/A | N/A |
| Gemini 3 Flash | N/A | 0.50 | 3.00 | 14.4 | N/A |
🛠️ Technical Deep Dive
- •Replaces autoregressive sequential decoding with diffusion-based parallel refinement: starts with noisy sketch of response and denoises over few steps to coherent text.[1][5][6]
- •Processes entire sequences simultaneously using bidirectional attention, enabling multiple tokens at once but optimized to reduce recomputation costs per step.[1]
- •Runs on NVIDIA Blackwell GPUs; supports 128K context, native tools, tunable reasoning depth, and schema-aligned JSON mode.[3][4][5]
🔮 Future ImplicationsAI analysis grounded in cited sources
⏳ Timeline
📎 Sources (8)
Factual claims are grounded in the sources below. Forward-looking analysis is AI-generated interpretation.
- implicator.ai — Inception Ships Mercury 2 a Diffusion LLM That Hits 1 009 Tokens Per Second
- techmeme.com — P22
- gigazine.net — 20260225 Inception Mercury 2
- deeplearning.ai — Anthropic U S Square Off Over AI Safeguards
- inceptionlabs.ai — Introducing Mercury 2
- the-decoder.com — Inception Launches Mercury 2 the First Diffusion Based Language Reasoning Model
- businesswire.com — Inception Launches Mercury 2 the Fastest Reasoning LLM 5x Faster Than Leading Speed Optimized Llms with Dramatically Lower Inference Cost
- perplexity.ai — Inception Launches Mercury 2 a Xwostpaurasjoadyw65o7g
Weekly AI Recap
Read this week's curated digest of top AI events →
👉Related Updates
AI-curated news aggregator. All content rights belong to original publishers.
Original source: 量子位 ↗
