AI Updates Aggregator

⚛️量子位•Feb 26, 2026Stalecollected in 27m

Diffusion Models Hit 1009 Tokens/Sec

Post LinkedIn

⚛️Read original on 量子位

#diffusion-models #inference-speed #llm-generationdiffusion-models

💡Diffusion at 1009 tps beats autoregressive – Nvidia/MS invest, paradigm shift incoming!

⚡ 30-Second TL;DR

What Changed

Diffusion models generate 1009 tokens per second

Why It Matters

This breakthrough could end reliance on slow autoregressive generation, enabling real-time complex reasoning in LLMs. Investments signal industry shift toward diffusion for scalable inference.

What To Do Next

Implement diffusion-based generation from arXiv papers like Diffusion-LM to benchmark against autoregressive baselines.

Who should care:Researchers & Academics

🧠 Deep Insight

Web-grounded analysis with 8 cited sources.

🔑 Enhanced Key Takeaways

•Inception Labs, founded by Stanford professor Stefano Ermon who contributed to Stable Diffusion and DALL-E, released Mercury 2 on February 24, 2026, backed by a $50M seed round from Menlo Ventures.[1][5]
•Mercury 2 features a 128,000 token context window, native tool use, tunable inference, and schema-compliant JSON output, targeting latency-sensitive applications like voice AI, code editors, and agentic loops.[3][4][5]
•The model achieves end-to-end latency of 1.7 seconds, undercutting competitors on benchmarks such as AIME 2025 (91.1), GPQA (73.6), and LiveCodeBench (67.3).[6][7]

📊 Competitor Analysis▸ Show

Model	Speed (tokens/sec)	Input Price ($/M)	Output Price ($/M)	Latency (sec)	Context Length
Mercury 2	1009	0.25	0.75	1.7	128K
Claude Haiku 4.5	~89	1.00	5.00	23.4	N/A
GPT-5 Mini	~71	N/A	N/A	N/A	N/A
Gemini 3 Flash	N/A	0.50	3.00	14.4	N/A

🛠️ Technical Deep Dive

•Replaces autoregressive sequential decoding with diffusion-based parallel refinement: starts with noisy sketch of response and denoises over few steps to coherent text.[1][5][6]
•Processes entire sequences simultaneously using bidirectional attention, enabling multiple tokens at once but optimized to reduce recomputation costs per step.[1]
•Runs on NVIDIA Blackwell GPUs; supports 128K context, native tools, tunable reasoning depth, and schema-aligned JSON mode.[3][4][5]

🔮 Future ImplicationsAI analysis grounded in cited sources

Diffusion LLMs will capture >30% of latency-critical inference market by 2027

Mercury 2's 5x speed and 2-4x cost advantages over autoregressive models enable viable multi-step agentic workflows in real-time apps like voice and code tools.[1][4][6]

NVIDIA Blackwell adoption surges 50% for text diffusion models in 2026

Mercury 2 demonstrates 1009 tokens/sec exclusively on Blackwell GPUs, incentivizing inference providers to upgrade for parallel text generation efficiency.[1][3][5]

Reasoning quality matches top models at <20% inference cost

Benchmarks show Mercury 2 competitive with Claude Haiku 4.5 and GPT-5 Mini on AIME/GPQA while priced at $0.25/$0.75 per million tokens.[5][7]

⏳ Timeline

2024-12

Inception Labs founded by Stanford's Stefano Ermon, leveraging diffusion expertise from Stable Diffusion/DALL-E work.

2025-01

Inception secures $50M seed funding from Menlo Ventures to commercialize diffusion for language models.

2026-02

Mercury 2 released on February 24 as first production diffusion reasoning LLM, hitting 1009 tokens/sec on Blackwell GPUs.

📎 Sources (8)

Factual claims are grounded in the sources below. Forward-looking analysis is AI-generated interpretation.

⚛️Read original article on 量子位

📰

Weekly AI Recap

Read this week's curated digest of top AI events →

👉Related Updates

Same topic

Explore #diffusion-models

Same product

AI-curated news aggregator. All content rights belong to original publishers.
Original source: 量子位 ↗

⚡ 30-Second TL;DR

🧠 Deep Insight

🔑 Enhanced Key Takeaways

🛠️ Technical Deep Dive

🔮 Future ImplicationsAI analysis grounded in cited sources

⏳ Timeline

📎 Sources (8)

👉Related Updates

Stanford: US-China LLMs Now Even

Lee & Lu Bet Big on Harness Agents

ZTE Targets AI Infra with OpenClaw

Qwen Agents Generate Excel from Chat