๐Ÿฆ™Stalecollected in 4h

397B Qwen MoE Achieves 20 t/s on RTX 5090

PostLinkedIn
๐Ÿฆ™Read original on Reddit r/LocalLLaMA

๐Ÿ’กSingle 5090 runs 397B MoE at 20 t/s TG w/ RAM offloadโ€”local frontier viable

โšก 30-Second TL;DR

What Changed

717 t/s prompt, 20 t/s TG on 397B Q4_K_M with ngl 999

Why It Matters

Proves single high-end consumer GPU + ample system RAM can handle 400B-class MoEs efficiently, democratizing access to frontier models locally.

What To Do Next

Benchmark Qwen3.5-397B-A17B Q4_K_M on your 5090 using llama-bench with CPU FFN offload.

Who should care:Developers & AI Engineers

๐Ÿง  Deep Insight

AI-generated analysis for this event.

๐Ÿ”‘ Enhanced Key Takeaways

  • โ€ขThe performance is achieved through a specialized implementation of llama.cpp that leverages 'MoE-aware' offloading, specifically targeting the sparse activation nature of the Qwen3.5-397B architecture to minimize PCIe bandwidth bottlenecks.
  • โ€ขThe 20 t/s throughput is highly dependent on the specific CPU's memory bandwidth; the use of AMD EPYC processors with high-channel DDR4/DDR5 memory is critical to prevent the CPU-side FFN computation from becoming the primary latency driver.
  • โ€ขThis benchmark demonstrates the viability of 'hybrid-compute' inference, where the GPU acts as a high-speed KV cache and attention mechanism accelerator, while the system RAM handles the bulk of the model's parameter storage.
๐Ÿ“Š Competitor Analysisโ–ธ Show
FeatureQwen3.5-397B (Hybrid)Grok-3 (Cloud)DeepSeek-V3 (Cloud)
HardwareSingle RTX 5090 + EPYCH100/B200 ClustersH100/B200 Clusters
Latency~20 t/s (Variable)< 50 t/s (Stable)< 50 t/s (Stable)
CostHardware CapEx onlyUsage-based (High)Usage-based (High)
PrivacyLocal/PrivateCloud-dependentCloud-dependent

๐Ÿ› ๏ธ Technical Deep Dive

  • Architecture: Qwen3.5-397B is a Mixture-of-Experts (MoE) model with 397B total parameters and 17B active parameters per token.
  • Offloading Strategy: The implementation uses llama.cpp's n_gpu_layers (ngl) parameter to keep the attention heads and KV cache on the RTX 5090's 32GB VRAM, while offloading the Feed-Forward Network (FFN) expert layers to system RAM.
  • Memory Bottleneck: The system relies on the PCIe 5.0 bus to transfer expert weights from system RAM to the GPU/CPU compute units, making high-bandwidth memory (HBM) or multi-channel DDR5 essential for maintaining generation speeds.
  • Quantization: The Q4_K_M quantization reduces the model footprint to approximately 220-230GB, allowing it to fit within the 256GB system RAM capacity.

๐Ÿ”ฎ Future ImplicationsAI analysis grounded in cited sources

Consumer-grade hardware will support full-parameter fine-tuning of 400B+ models within 18 months.
The success of hybrid offloading techniques suggests that memory-efficient training methods will soon follow the path of inference optimization.
The demand for high-bandwidth system memory (DDR5/DDR6) will outpace GPU VRAM growth for local LLM enthusiasts.
As models grow larger, the bottleneck shifts from GPU compute to the system-to-GPU data transfer rate, necessitating faster system memory architectures.

โณ Timeline

2025-09
Release of Qwen3.5 series with enhanced MoE architecture.
2026-01
NVIDIA launches RTX 5090 with 32GB VRAM, enabling new local inference benchmarks.
2026-02
Llama.cpp introduces optimized MoE-offloading support for high-parameter models.
๐Ÿ“ฐ

Weekly AI Recap

Read this week's curated digest of top AI events โ†’

๐Ÿ‘‰Related Updates

AI-curated news aggregator. All content rights belong to original publishers.
Original source: Reddit r/LocalLLaMA โ†—