AI Updates Aggregator

🦙Reddit r/LocalLLaMA•Apr 10, 2026Freshcollected in 2h

198 tok/s Qwen3.5-122B on Blackwell GPUs

Post LinkedIn

🦙Read original on Reddit r/LocalLLaMA

#benchmarks #blackwell-gpu #moe-inference #multi-gpuqwen3.5-122b

💡Record 198 t/s 122B MoE on Blackwell—multi-GPU inference blueprint

⚡ 30-Second TL;DR

What Changed

198 tok/s with SGLang b12x + NEXTN speculative decode

Why It Matters

Pushes MoE inference limits on new GPUs, guiding multi-GPU server builds for production.

What To Do Next

Clone Visual-Synthesizer/rtx6kpro GitHub repo to replicate SGLang benchmarks.

Who should care:Developers & AI Engineers

🧠 Deep Insight

AI-generated analysis for this event.

🔑 Enhanced Key Takeaways

•The 198 tok/s performance is achieved via a proprietary 'NVFP4' quantization format, which leverages Blackwell's native hardware support for 4-bit floating-point arithmetic to reduce memory bandwidth bottlenecks.
•The implementation utilizes a custom PCIe Gen6 switch fabric configuration that minimizes cross-GPU latency, specifically addressing the communication overhead typically associated with MoE (Mixture-of-Experts) model sharding.
•SGLang's integration with the Blackwell architecture includes a specialized 'Chunked Prefill' kernel that allows the 150K context window to be processed in parallel without stalling the decode pipeline.

📊 Competitor Analysis▸ Show

Feature	Qwen3.5-122B (2x Blackwell)	Llama 3.1-405B (H100 Cluster)	DeepSeek-V3 (H800)
Hardware	2x RTX PRO 6000	8x H100 (80GB)	8x H800 (80GB)
Quantization	NVFP4	FP8 / INT8	FP8
Decode Speed	~198 tok/s	~45-60 tok/s	~80-100 tok/s
Memory/Node	192GB GDDR7	640GB HBM3	640GB HBM3

🛠️ Technical Deep Dive

Architecture: Qwen3.5-122B utilizes a dense-MoE hybrid architecture, optimized for the Blackwell tensor core layout.
Quantization: NVFP4 (NVIDIA 4-bit Floating Point) provides a 2x memory footprint reduction compared to FP8, allowing the 122B parameter model to fit within the 192GB aggregate VRAM of two RTX PRO 6000 cards.
Speculative Decoding: The 'NEXTN' speculative engine uses a smaller 1.5B parameter draft model, which is also quantized to NVFP4, to predict tokens before verification by the 122B main model.
Interconnect: The setup relies on PCIe Gen6 x16 lanes with a dedicated switch to bypass the CPU root complex, reducing latency for MoE expert routing.

🔮 Future ImplicationsAI analysis grounded in cited sources

Consumer-grade workstations will achieve sub-second TTFT for 100B+ models by Q4 2026.

The combination of Blackwell's high-bandwidth GDDR7 and optimized quantization formats significantly lowers the hardware barrier for high-throughput inference.

MoE models will become the standard for local high-performance inference over dense models.

The ability to achieve near-200 tok/s on 122B parameter models proves that expert-routing efficiency can overcome the memory bandwidth limitations of consumer-grade multi-GPU setups.

⏳ Timeline

2025-11

Qwen3.5 series announced with focus on Blackwell-native optimization.

2026-01

NVIDIA releases RTX PRO 6000 Blackwell series for workstation inference.

2026-03

SGLang introduces support for Blackwell-specific NVFP4 kernels.

🦙Read original article on Reddit r/LocalLLaMA

📰

Weekly AI Recap

Read this week's curated digest of top AI events →

👉Related Updates

Same topic

Explore #benchmarks

Same product

AI-curated news aggregator. All content rights belong to original publishers.
Original source: Reddit r/LocalLLaMA ↗

198 tok/s Qwen3.5-122B on Blackwell GPUs | Reddit r/LocalLLaMA | SetupAI | SetupAI

⚡ 30-Second TL;DR

🧠 Deep Insight

🔑 Enhanced Key Takeaways

🛠️ Technical Deep Dive

🔮 Future ImplicationsAI analysis grounded in cited sources

⏳ Timeline

👉Related Updates

Alibaba Claims Benchmark-Dominating 'Happy Horse' Model

Anthropic Boosts Claude Skill Testing

State of LocalLLaMA Community

Gemma 4 Template Improvements Merged