๐คReddit r/MachineLearningโขStalecollected in 12h
1M Tokens/s Qwen 3.5 on B200 GPUs
๐กBenchmark secrets for 1M tok/s Qwen on 96 B200s revealed
โก 30-Second TL;DR
What Changed
1.1M tok/s on 96 B200s with vLLM v0.18.0, FP8 dense model.
Why It Matters
Sets new bar for LLM inference scaling on latest NVIDIA GPUs, aiding high-throughput production deployments.
What To Do Next
Test vLLM v0.18.0 MTP-1 on B200s for your 27B-scale LLM serving.
Who should care:Developers & AI Engineers
๐ง Deep Insight
AI-generated analysis for this event.
๐ Enhanced Key Takeaways
- โขThe 1.1M tok/s throughput is achieved specifically by leveraging the Blackwell architecture's enhanced FP8 tensor core performance, which provides a significant compute-to-memory bandwidth advantage over Hopper-based deployments for dense models like Qwen 3.5.
- โขThe observed 4x performance gap between Data Parallelism (DP=8) and Tensor Parallelism (TP=8) highlights the diminishing returns of inter-GPU communication overhead on NVLink Switch systems when the model size fits comfortably within the HBM3e capacity of a single B200.
- โขThe 35% overhead introduced by the Inference Gateway is primarily attributed to the serialization/deserialization latency and the dynamic load balancing logic required to manage the high-concurrency request streams necessary to saturate 96 B200s.
๐ Competitor Analysisโธ Show
| Feature | Qwen 3.5 (B200/vLLM) | Llama 3.3 (H100/vLLM) | DeepSeek-V3 (H200/SGLang) |
|---|---|---|---|
| Throughput (Tok/s) | ~11.4k/GPU | ~8.2k/GPU | ~9.5k/GPU |
| Precision | FP8 | FP8/BF16 | FP8 |
| Scaling Efficiency | 97.1% | 92% | 94% |
| Primary Bottleneck | Gateway Latency | Inter-node NVLink | Memory Bandwidth |
๐ ๏ธ Technical Deep Dive
- Architecture: Qwen 3.5 27B utilizes a dense Transformer architecture optimized for FP8 quantization, allowing for higher batch sizes per B200 compared to FP16.
- vLLM v0.18.0 Enhancements: Includes specific kernel optimizations for Blackwell's Transformer Engine, enabling asynchronous FP8 GEMM operations.
- MTP (Multi-Token Prediction) Integration: MTP-1 allows the model to predict the next token while simultaneously verifying the previous token's logit distribution, reducing the effective latency per token.
- Communication Topology: The DP=8 configuration minimizes cross-node traffic by replicating the model weights across nodes, relying on local NVLink for intra-node communication, which avoids the latency penalties of All-Reduce operations required by TP=8.
๐ฎ Future ImplicationsAI analysis grounded in cited sources
Inference Gateways will become the primary bottleneck for multi-node GPU clusters.
As compute throughput scales linearly with new GPU architectures, the overhead of request routing and load balancing is failing to scale at the same rate.
Data Parallelism will replace Tensor Parallelism for sub-30B parameter models in production.
The massive HBM capacity of modern GPUs makes model replication more efficient than the communication-heavy partitioning required by Tensor Parallelism.
โณ Timeline
2024-09
Qwen 2.5 series release, establishing the foundation for the 3.5 architecture.
2025-03
NVIDIA begins mass shipments of B200 Blackwell GPUs to major cloud providers.
2026-01
vLLM v0.18.0 release, introducing native support for Blackwell-specific FP8 kernels.
2026-02
Qwen 3.5 27B model weights released with optimized FP8 quantization parameters.
๐ฐ
Weekly AI Recap
Read this week's curated digest of top AI events โ
๐Related Updates
AI-curated news aggregator. All content rights belong to original publishers.
Original source: Reddit r/MachineLearning โ