๐Ÿค–Stalecollected in 12h

1M Tokens/s Qwen 3.5 on B200 GPUs

PostLinkedIn
๐Ÿค–Read original on Reddit r/MachineLearning

๐Ÿ’กBenchmark secrets for 1M tok/s Qwen on 96 B200s revealed

โšก 30-Second TL;DR

What Changed

1.1M tok/s on 96 B200s with vLLM v0.18.0, FP8 dense model.

Why It Matters

Sets new bar for LLM inference scaling on latest NVIDIA GPUs, aiding high-throughput production deployments.

What To Do Next

Test vLLM v0.18.0 MTP-1 on B200s for your 27B-scale LLM serving.

Who should care:Developers & AI Engineers

๐Ÿง  Deep Insight

AI-generated analysis for this event.

๐Ÿ”‘ Enhanced Key Takeaways

  • โ€ขThe 1.1M tok/s throughput is achieved specifically by leveraging the Blackwell architecture's enhanced FP8 tensor core performance, which provides a significant compute-to-memory bandwidth advantage over Hopper-based deployments for dense models like Qwen 3.5.
  • โ€ขThe observed 4x performance gap between Data Parallelism (DP=8) and Tensor Parallelism (TP=8) highlights the diminishing returns of inter-GPU communication overhead on NVLink Switch systems when the model size fits comfortably within the HBM3e capacity of a single B200.
  • โ€ขThe 35% overhead introduced by the Inference Gateway is primarily attributed to the serialization/deserialization latency and the dynamic load balancing logic required to manage the high-concurrency request streams necessary to saturate 96 B200s.
๐Ÿ“Š Competitor Analysisโ–ธ Show
FeatureQwen 3.5 (B200/vLLM)Llama 3.3 (H100/vLLM)DeepSeek-V3 (H200/SGLang)
Throughput (Tok/s)~11.4k/GPU~8.2k/GPU~9.5k/GPU
PrecisionFP8FP8/BF16FP8
Scaling Efficiency97.1%92%94%
Primary BottleneckGateway LatencyInter-node NVLinkMemory Bandwidth

๐Ÿ› ๏ธ Technical Deep Dive

  • Architecture: Qwen 3.5 27B utilizes a dense Transformer architecture optimized for FP8 quantization, allowing for higher batch sizes per B200 compared to FP16.
  • vLLM v0.18.0 Enhancements: Includes specific kernel optimizations for Blackwell's Transformer Engine, enabling asynchronous FP8 GEMM operations.
  • MTP (Multi-Token Prediction) Integration: MTP-1 allows the model to predict the next token while simultaneously verifying the previous token's logit distribution, reducing the effective latency per token.
  • Communication Topology: The DP=8 configuration minimizes cross-node traffic by replicating the model weights across nodes, relying on local NVLink for intra-node communication, which avoids the latency penalties of All-Reduce operations required by TP=8.

๐Ÿ”ฎ Future ImplicationsAI analysis grounded in cited sources

Inference Gateways will become the primary bottleneck for multi-node GPU clusters.
As compute throughput scales linearly with new GPU architectures, the overhead of request routing and load balancing is failing to scale at the same rate.
Data Parallelism will replace Tensor Parallelism for sub-30B parameter models in production.
The massive HBM capacity of modern GPUs makes model replication more efficient than the communication-heavy partitioning required by Tensor Parallelism.

โณ Timeline

2024-09
Qwen 2.5 series release, establishing the foundation for the 3.5 architecture.
2025-03
NVIDIA begins mass shipments of B200 Blackwell GPUs to major cloud providers.
2026-01
vLLM v0.18.0 release, introducing native support for Blackwell-specific FP8 kernels.
2026-02
Qwen 3.5 27B model weights released with optimized FP8 quantization parameters.
๐Ÿ“ฐ

Weekly AI Recap

Read this week's curated digest of top AI events โ†’

๐Ÿ‘‰Related Updates

AI-curated news aggregator. All content rights belong to original publishers.
Original source: Reddit r/MachineLearning โ†—