🐯虎嗅•Mar 16, 2026Stalecollected in 26m

NVIDIA Rubin+Groq Hits $1T GPU Projection

Post LinkedIn

🐯Read original on 虎嗅

#gtc #lpu #inference #datacenternvidia-vera-rubin

💡Rubin+Groq delivers 350x token throughput—key for scaling agentic AI inference.

⚡ 30-Second TL;DR

What Changed

Vera Rubin GPU: TSMC 3nm, 336B transistors, 288GB HBM4, 50 PFLOPs NVFP4 inference (5x Blackwell).

Why It Matters

This Rubin+Groq combo redefines AI inference scaling, enabling premium agentic models at lower latency/cost, pressuring competitors like custom ASICs. Enterprises can now tier services by interaction speed, capturing higher pricing for complex reasoning tasks.

What To Do Next

Test NVIDIA Dynamo for disaggregated inference to boost your LLM token generation speed.

Who should care:Enterprise & Security Teams

🧠 Deep Insight

Web-grounded analysis with 5 cited sources.

🔑 Enhanced Key Takeaways

•NVIDIA Vera CPU features 88 custom Olympus Arm cores with spatial multi-threading supporting up to 176 threads, paired with up to 1.5TB LPDDR5X SOCAMM memory at 1.2 TB/s bandwidth[1][2][3].
•Vera Rubin NVL72 rack integrates 72 Rubin GPUs and 36 Vera CPUs, delivering 3.6 exaFLOPS NVFP4 inference, 54TB LPDDR5X, 20.7TB HBM4, and 1.6 PB/s HBM4 bandwidth[1][2][3].
•Rubin GPUs provide 35 PFLOPS NVFP4 training performance (3.5x Blackwell), enable 1/4 the GPUs for MoE model training, and reduce MoE inference cost per token by up to 10x[2][3].
•Rubin CPX GPU variant uses monolithic die with 128GB GDDR7 memory, 30 PFLOPS NVFP4 compute, and 3x faster attention for million-token contexts in NVL144 CPX platform[4].

🛠️ Technical Deep Dive

•Rubin GPU is dual-die on TSMC 3nm with reticle-sized dies, eight HBM4 stacks for 22 TB/s bandwidth (2.8x Blackwell HBM3e), supporting third-generation Transformer Engine with NVFP4/NVFP8[1][3].
•Vera CPU connects to Rubin GPUs via NVLink C2C gen2 at 1.8 TB/s coherent bandwidth, forming unified memory pool with HBM4 and LPDDR5X for KV cache and model weights[2][3].
•NVLink 6 provides 3.6 TB/s GPU-to-GPU and 260 TB/s rack bandwidth; Rubin supports SMT with 176 threads and 2x data/compression performance over Grace CPU[1][3].
•Rubin CPX optimizes inference with NVFP4 resources, 100TB fast memory, and 1.7 PB/s bandwidth in NVL144, offering 7.5x performance over GB300 NVL72[4].

🔮 Future ImplicationsAI analysis grounded in cited sources

Production Rubin hardware ships to partners in H2 2026

NVIDIA plans deliveries for partner products and DGX systems in second half of 2026 to meet AI datacenter demand[1].

Rubin reduces MoE inference cost-per-token by 10x versus Blackwell

Enhanced low-precision formats and bandwidth enable 10x lower costs across models while increasing token throughput in same rack space[2].

⏳ Timeline

2025-09

NVIDIA unveils Rubin CPX GPU for massive-context inference in NVL144 platform

2026-01

NVIDIA launches Rubin AI platform and Vera Rubin NVL72 at CES with 5x inference gains

2026-03

NVIDIA announces Vera Rubin system with Groq LPU integration at GTC for 350x token boost

📎 Sources (5)

Factual claims are grounded in the sources below. Forward-looking analysis is AI-generated interpretation.

🐯Read original article on 虎嗅

📰

Weekly AI Recap

Read this week's curated digest of top AI events →

👉Related Updates

Same topic

Explore #gtc

Same product

AI-curated news aggregator. All content rights belong to original publishers.
Original source: 虎嗅 ↗

⚡ 30-Second TL;DR

🧠 Deep Insight

🔑 Enhanced Key Takeaways

🛠️ Technical Deep Dive

🔮 Future ImplicationsAI analysis grounded in cited sources

⏳ Timeline

📎 Sources (5)

👉Related Updates

3x HFQ4 Prefill Speedup on Strix Halo

DJI Nets 20B Profit Matching Luxury Margins

Changxin Tech Profits Plunge 40% Amid Battery Woes

Cursor AI Deletes Prod DB in 9s