🦙Reddit r/LocalLLaMA•Freshcollected in 2h
198 tok/s Qwen3.5-122B on Blackwell GPUs
💡Record 198 t/s 122B MoE on Blackwell—multi-GPU inference blueprint
⚡ 30-Second TL;DR
What Changed
198 tok/s with SGLang b12x + NEXTN speculative decode
Why It Matters
Pushes MoE inference limits on new GPUs, guiding multi-GPU server builds for production.
What To Do Next
Clone Visual-Synthesizer/rtx6kpro GitHub repo to replicate SGLang benchmarks.
Who should care:Developers & AI Engineers
🧠 Deep Insight
AI-generated analysis for this event.
🔑 Enhanced Key Takeaways
- •The 198 tok/s performance is achieved via a proprietary 'NVFP4' quantization format, which leverages Blackwell's native hardware support for 4-bit floating-point arithmetic to reduce memory bandwidth bottlenecks.
- •The implementation utilizes a custom PCIe Gen6 switch fabric configuration that minimizes cross-GPU latency, specifically addressing the communication overhead typically associated with MoE (Mixture-of-Experts) model sharding.
- •SGLang's integration with the Blackwell architecture includes a specialized 'Chunked Prefill' kernel that allows the 150K context window to be processed in parallel without stalling the decode pipeline.
📊 Competitor Analysis▸ Show
| Feature | Qwen3.5-122B (2x Blackwell) | Llama 3.1-405B (H100 Cluster) | DeepSeek-V3 (H800) |
|---|---|---|---|
| Hardware | 2x RTX PRO 6000 | 8x H100 (80GB) | 8x H800 (80GB) |
| Quantization | NVFP4 | FP8 / INT8 | FP8 |
| Decode Speed | ~198 tok/s | ~45-60 tok/s | ~80-100 tok/s |
| Memory/Node | 192GB GDDR7 | 640GB HBM3 | 640GB HBM3 |
🛠️ Technical Deep Dive
- Architecture: Qwen3.5-122B utilizes a dense-MoE hybrid architecture, optimized for the Blackwell tensor core layout.
- Quantization: NVFP4 (NVIDIA 4-bit Floating Point) provides a 2x memory footprint reduction compared to FP8, allowing the 122B parameter model to fit within the 192GB aggregate VRAM of two RTX PRO 6000 cards.
- Speculative Decoding: The 'NEXTN' speculative engine uses a smaller 1.5B parameter draft model, which is also quantized to NVFP4, to predict tokens before verification by the 122B main model.
- Interconnect: The setup relies on PCIe Gen6 x16 lanes with a dedicated switch to bypass the CPU root complex, reducing latency for MoE expert routing.
🔮 Future ImplicationsAI analysis grounded in cited sources
Consumer-grade workstations will achieve sub-second TTFT for 100B+ models by Q4 2026.
The combination of Blackwell's high-bandwidth GDDR7 and optimized quantization formats significantly lowers the hardware barrier for high-throughput inference.
MoE models will become the standard for local high-performance inference over dense models.
The ability to achieve near-200 tok/s on 122B parameter models proves that expert-routing efficiency can overcome the memory bandwidth limitations of consumer-grade multi-GPU setups.
⏳ Timeline
2025-11
Qwen3.5 series announced with focus on Blackwell-native optimization.
2026-01
NVIDIA releases RTX PRO 6000 Blackwell series for workstation inference.
2026-03
SGLang introduces support for Blackwell-specific NVFP4 kernels.
📰
Weekly AI Recap
Read this week's curated digest of top AI events →
👉Related Updates
AI-curated news aggregator. All content rights belong to original publishers.
Original source: Reddit r/LocalLLaMA ↗



