⚛️Stalecollected in 27m

Diffusion Models Hit 1009 Tokens/Sec

PostLinkedIn
⚛️Read original on 量子位

💡Diffusion at 1009 tps beats autoregressive – Nvidia/MS invest, paradigm shift incoming!

⚡ 30-Second TL;DR

What Changed

Diffusion models generate 1009 tokens per second

Why It Matters

This breakthrough could end reliance on slow autoregressive generation, enabling real-time complex reasoning in LLMs. Investments signal industry shift toward diffusion for scalable inference.

What To Do Next

Implement diffusion-based generation from arXiv papers like Diffusion-LM to benchmark against autoregressive baselines.

Who should care:Researchers & Academics

🧠 Deep Insight

Web-grounded analysis with 8 cited sources.

🔑 Enhanced Key Takeaways

  • Inception Labs, founded by Stanford professor Stefano Ermon who contributed to Stable Diffusion and DALL-E, released Mercury 2 on February 24, 2026, backed by a $50M seed round from Menlo Ventures.[1][5]
  • Mercury 2 features a 128,000 token context window, native tool use, tunable inference, and schema-compliant JSON output, targeting latency-sensitive applications like voice AI, code editors, and agentic loops.[3][4][5]
  • The model achieves end-to-end latency of 1.7 seconds, undercutting competitors on benchmarks such as AIME 2025 (91.1), GPQA (73.6), and LiveCodeBench (67.3).[6][7]
📊 Competitor Analysis▸ Show
ModelSpeed (tokens/sec)Input Price ($/M)Output Price ($/M)Latency (sec)Context Length
Mercury 210090.250.751.7128K
Claude Haiku 4.5~891.005.0023.4N/A
GPT-5 Mini~71N/AN/AN/AN/A
Gemini 3 FlashN/A0.503.0014.4N/A

🛠️ Technical Deep Dive

  • Replaces autoregressive sequential decoding with diffusion-based parallel refinement: starts with noisy sketch of response and denoises over few steps to coherent text.[1][5][6]
  • Processes entire sequences simultaneously using bidirectional attention, enabling multiple tokens at once but optimized to reduce recomputation costs per step.[1]
  • Runs on NVIDIA Blackwell GPUs; supports 128K context, native tools, tunable reasoning depth, and schema-aligned JSON mode.[3][4][5]

🔮 Future ImplicationsAI analysis grounded in cited sources

Diffusion LLMs will capture >30% of latency-critical inference market by 2027
Mercury 2's 5x speed and 2-4x cost advantages over autoregressive models enable viable multi-step agentic workflows in real-time apps like voice and code tools.[1][4][6]
NVIDIA Blackwell adoption surges 50% for text diffusion models in 2026
Mercury 2 demonstrates 1009 tokens/sec exclusively on Blackwell GPUs, incentivizing inference providers to upgrade for parallel text generation efficiency.[1][3][5]
Reasoning quality matches top models at <20% inference cost
Benchmarks show Mercury 2 competitive with Claude Haiku 4.5 and GPT-5 Mini on AIME/GPQA while priced at $0.25/$0.75 per million tokens.[5][7]

Timeline

2024-12
Inception Labs founded by Stanford's Stefano Ermon, leveraging diffusion expertise from Stable Diffusion/DALL-E work.
2025-01
Inception secures $50M seed funding from Menlo Ventures to commercialize diffusion for language models.
2026-02
Mercury 2 released on February 24 as first production diffusion reasoning LLM, hitting 1009 tokens/sec on Blackwell GPUs.
📰

Weekly AI Recap

Read this week's curated digest of top AI events →

👉Related Updates

AI-curated news aggregator. All content rights belong to original publishers.
Original source: 量子位