Qwen3.5 122B Excels on 3x3090 at 25 tok/s

Post LinkedIn

🦙Read original on Reddit r/LocalLLaMA

#quantization #multi-gpu #inference-speedqwen3.5-122b

💡122B model runs 25 tok/s on 3x3090 w/ 120k ctx—settings for loop-free perf.

⚡ 30-Second TL;DR

What Changed

25 tok/s on 3x3090 (72GB VRAM) fully GPU-loaded

Why It Matters

Proves high-end consumer GPUs viable for 122B models, democratizing access to top performance without data center costs.

What To Do Next

Quantize Qwen3.5 122B to Q3_K and apply shared sampling params on your multi-GPU rig.

Who should care:Developers & AI Engineers

🧠 Deep Insight

Web-grounded analysis with 7 cited sources.

🔑 Enhanced Key Takeaways

•Qwen3.5-122B-A10B uses a hybrid Mixture-of-Experts (MoE) architecture with only 10 billion active parameters, enabling efficient inference on consumer-grade hardware despite its 122B total parameter count[2].
•The model integrates Gated Delta Networks (linear attention) with standard Gated Attention blocks, reducing memory footprint and enabling high-throughput decoding on standard hardware[2].
•Qwen3.5-122B-A10B supports 256K context length across 201 languages with both thinking and non-thinking modes, outperforming the previous generation Qwen3-235B-2507 in text capabilities and Qwen3-VL-235B in visual capabilities[1][3].

🛠️ Technical Deep Dive

Architecture: Hybrid design combining Gated Delta Networks (linear attention mechanism) with sparse Mixture-of-Experts (MoE) model[1][2]
Active Parameters: 10 billion active parameters (A10B) out of 122B total, enabling efficient inference[2]
Context Length: 262,144 tokens (256K) supporting long-horizon tasks[1]
Input/Output Types: Accepts text, image, and video inputs; outputs text[1]
Training Pipeline: Four-stage post-training involving long chain-of-thought (CoT) cold starts and reasoning-based reinforcement learning[2]
Performance Benchmark: Maintains logical consistency over long-horizon tasks while rivaling much larger dense models[2]
Quantization Support: Compatible with Q3_K quantization for 120k context deployment on 72GB VRAM systems[Search results do not provide quantization details; inference from article context]

🔮 Future ImplicationsAI analysis grounded in cited sources

MoE efficiency gains will drive adoption of smaller models for enterprise deployment

Qwen3.5-35B-A3B outperforming the 235B predecessor demonstrates that architectural efficiency can replace raw parameter scaling, reducing infrastructure costs for organizations.

Consumer-grade multi-GPU setups become viable for frontier-level model inference

Achieving 25 tok/s on three RTX 3090s (72GB total VRAM) makes high-performance local inference accessible without enterprise-grade hardware investment.