๐Ÿฆ™Stalecollected in 6h

RTX PRO 6000 MoE Benchmark: 50.5 tok/s Max

PostLinkedIn
๐Ÿฆ™Read original on Reddit r/LocalLLaMA

๐Ÿ’กReal-world 50.5 tok/s on 397B MoE + NVIDIA SM120 bugs exposed

โšก 30-Second TL;DR

What Changed

Marlin TP=4 achieves 50.5 tok/s sustainedโ€”best on SM120 hardware.

Why It Matters

Highlights NVIDIA validation gaps on workstation Blackwell, capping FP4 throughput. Guides inference optimization toward Marlin over broken CUTLASS for consumer GPUs.

What To Do Next

Benchmark Marlin TP=4 on your SM120 setup before trying CUTLASS or MTP.

Who should care:Developers & AI Engineers

๐Ÿง  Deep Insight

Web-grounded analysis with 6 cited sources.

๐Ÿ”‘ Enhanced Key Takeaways

  • โ€ขRTX PRO 6000 Blackwell features 96GB GDDR7 VRAM with 1.792 TB/s bandwidth and 512-bit memory interface, enabling single-GPU handling of 70B FP8 models or high-concurrency 30B AWQ MoE workloads[1][2][4].
  • โ€ขA single RTX PRO 6000 achieves ~8,400 tok/s on Qwen3-Coder-30B-AWQ at 400 concurrent requests, rivaling 4x RTX 4090 setups while using less power[2].
  • โ€ขBlackwell architecture provides 125 TFLOPS FP32, 752 5th-gen Tensor Cores, and native FP4/NVFP4 support optimized for quantized LLM inference[1][4].
  • โ€ขRTX PRO 6000 supports LoRA/QLoRA fine-tuning of 30B-40B models on a single GPU due to high VRAM and Tensor Core efficiency[2].
๐Ÿ“Š Competitor Analysisโ–ธ Show
FeatureRTX PRO 6000 BlackwellRTX 6000 AdaH100 PCIeRTX 4090 (4x)
VRAM96GB GDDR748GB GDDR680GB HBM2e96GB total GDDR6X
Bandwidth1.792 TB/s960 GB/s2.0 TB/s~3.9 TB/s total
FP32 TFLOPS12591.1N/A~132 total
30B AWQ tok/s~8,400 (1x)N/AN/A~8,900 (4x)
TDP (est.)400-600W~300W700W~1,600W total

๐Ÿ› ๏ธ Technical Deep Dive

  • โ€ขBlackwell GPU (SM120) includes 24,064 CUDA cores, 752 5th-gen Tensor Cores (up to 4000 AI TOPS), 188 4th-gen RT Cores, PCIe 5.0 x16, and MIG partitioning for multi-instance workloads[4][5].
  • โ€ข96GB GDDR7 ECC memory on 512-bit bus delivers 1.792 TB/s bandwidth, critical for KV cache in long-context (32k/64k) LLM serving with high batch sizes[1][2][4].
  • โ€ขNative NVFP4/FP4 support in Tensor Cores enhances quantized inference efficiency for models like Qwen3-30B AWQ MoE, where active params (~3.3B) fit with 72GB KV headroom at 400 concurrency[1][2].

๐Ÿ”ฎ Future ImplicationsAI analysis grounded in cited sources

RTX PRO 6000 will reduce multi-GPU needs for 70B+ quantized MoE inference by 75% vs Ada generation
96GB VRAM and NVFP4 support enable single-GPU operation for models exceeding 80GB, as shown in 30B AWQ benchmarks rivaling 4x prior-gen setups[1][2].
CUTLASS SM120 bugs will be fixed in next CUDA update, unlocking 20-30% Marlin TP gains
Filed bug #3096 indicates initialization issues forcing FP16 fallback, common in early Blackwell kernel ports per similar reports[article context].

โณ Timeline

2024-03
NVIDIA announces Blackwell architecture at GTC with B100/B200 datacenter GPUs
2025-09
RTX PRO 6000 Blackwell Workstation launches with 96GB GDDR7 and SM120 compute
2025-11
Early benchmarks show RTX PRO 6000 matching 4x RTX 4090 on 30B MoE models
2026-01
Qwen3.5-397B released, enabling large-scale MoE testing on consumer hardware
2026-03
Reddit benchmark reports 50.5 tok/s on 4x RTX PRO 6000 with Marlin TP=4 for Qwen3.5-397B NVFP4
๐Ÿ“ฐ

Weekly AI Recap

Read this week's curated digest of top AI events โ†’

๐Ÿ‘‰Related Updates

AI-curated news aggregator. All content rights belong to original publishers.
Original source: Reddit r/LocalLLaMA โ†—