๐Ÿฆ™Stalecollected in 21h

Qwen3.5 122B Excels on 3x3090 at 25 tok/s

Qwen3.5 122B Excels on 3x3090 at 25 tok/s
PostLinkedIn
๐Ÿฆ™Read original on Reddit r/LocalLLaMA

๐Ÿ’ก122B model runs 25 tok/s on 3x3090 w/ 120k ctxโ€”settings for loop-free perf.

โšก 30-Second TL;DR

What Changed

25 tok/s on 3x3090 (72GB VRAM) fully GPU-loaded

Why It Matters

Proves high-end consumer GPUs viable for 122B models, democratizing access to top performance without data center costs.

What To Do Next

Quantize Qwen3.5 122B to Q3_K and apply shared sampling params on your multi-GPU rig.

Who should care:Developers & AI Engineers

๐Ÿง  Deep Insight

Web-grounded analysis with 7 cited sources.

๐Ÿ”‘ Enhanced Key Takeaways

  • โ€ขQwen3.5-122B-A10B uses a hybrid Mixture-of-Experts (MoE) architecture with only 10 billion active parameters, enabling efficient inference on consumer-grade hardware despite its 122B total parameter count[2].
  • โ€ขThe model integrates Gated Delta Networks (linear attention) with standard Gated Attention blocks, reducing memory footprint and enabling high-throughput decoding on standard hardware[2].
  • โ€ขQwen3.5-122B-A10B supports 256K context length across 201 languages with both thinking and non-thinking modes, outperforming the previous generation Qwen3-235B-2507 in text capabilities and Qwen3-VL-235B in visual capabilities[1][3].

๐Ÿ› ๏ธ Technical Deep Dive

  • Architecture: Hybrid design combining Gated Delta Networks (linear attention mechanism) with sparse Mixture-of-Experts (MoE) model[1][2]
  • Active Parameters: 10 billion active parameters (A10B) out of 122B total, enabling efficient inference[2]
  • Context Length: 262,144 tokens (256K) supporting long-horizon tasks[1]
  • Input/Output Types: Accepts text, image, and video inputs; outputs text[1]
  • Training Pipeline: Four-stage post-training involving long chain-of-thought (CoT) cold starts and reasoning-based reinforcement learning[2]
  • Performance Benchmark: Maintains logical consistency over long-horizon tasks while rivaling much larger dense models[2]
  • Quantization Support: Compatible with Q3_K quantization for 120k context deployment on 72GB VRAM systems[Search results do not provide quantization details; inference from article context]

๐Ÿ”ฎ Future ImplicationsAI analysis grounded in cited sources

MoE efficiency gains will drive adoption of smaller models for enterprise deployment
Qwen3.5-35B-A3B outperforming the 235B predecessor demonstrates that architectural efficiency can replace raw parameter scaling, reducing infrastructure costs for organizations.
Consumer-grade multi-GPU setups become viable for frontier-level model inference
Achieving 25 tok/s on three RTX 3090s (72GB total VRAM) makes high-performance local inference accessible without enterprise-grade hardware investment.

โณ Timeline

2026-02-24
Alibaba Qwen team releases Qwen3.5 Medium Model Series including Qwen3.5-122B-A10B, Qwen3.5-35B-A3B, and Qwen3.5-27B
2026-02-25
Qwen3.5-122B-A10B added to Writingmate AI model registry
๐Ÿ“ฐ

Weekly AI Recap

Read this week's curated digest of top AI events โ†’

๐Ÿ‘‰Related Updates

AI-curated news aggregator. All content rights belong to original publishers.
Original source: Reddit r/LocalLLaMA โ†—