๐ฆReddit r/LocalLLaMAโขStalecollected in 21h
Qwen3.5 122B Excels on 3x3090 at 25 tok/s

๐ก122B model runs 25 tok/s on 3x3090 w/ 120k ctxโsettings for loop-free perf.
โก 30-Second TL;DR
What Changed
25 tok/s on 3x3090 (72GB VRAM) fully GPU-loaded
Why It Matters
Proves high-end consumer GPUs viable for 122B models, democratizing access to top performance without data center costs.
What To Do Next
Quantize Qwen3.5 122B to Q3_K and apply shared sampling params on your multi-GPU rig.
Who should care:Developers & AI Engineers
๐ง Deep Insight
Web-grounded analysis with 7 cited sources.
๐ Enhanced Key Takeaways
- โขQwen3.5-122B-A10B uses a hybrid Mixture-of-Experts (MoE) architecture with only 10 billion active parameters, enabling efficient inference on consumer-grade hardware despite its 122B total parameter count[2].
- โขThe model integrates Gated Delta Networks (linear attention) with standard Gated Attention blocks, reducing memory footprint and enabling high-throughput decoding on standard hardware[2].
- โขQwen3.5-122B-A10B supports 256K context length across 201 languages with both thinking and non-thinking modes, outperforming the previous generation Qwen3-235B-2507 in text capabilities and Qwen3-VL-235B in visual capabilities[1][3].
๐ ๏ธ Technical Deep Dive
- Architecture: Hybrid design combining Gated Delta Networks (linear attention mechanism) with sparse Mixture-of-Experts (MoE) model[1][2]
- Active Parameters: 10 billion active parameters (A10B) out of 122B total, enabling efficient inference[2]
- Context Length: 262,144 tokens (256K) supporting long-horizon tasks[1]
- Input/Output Types: Accepts text, image, and video inputs; outputs text[1]
- Training Pipeline: Four-stage post-training involving long chain-of-thought (CoT) cold starts and reasoning-based reinforcement learning[2]
- Performance Benchmark: Maintains logical consistency over long-horizon tasks while rivaling much larger dense models[2]
- Quantization Support: Compatible with Q3_K quantization for 120k context deployment on 72GB VRAM systems[Search results do not provide quantization details; inference from article context]
๐ฎ Future ImplicationsAI analysis grounded in cited sources
MoE efficiency gains will drive adoption of smaller models for enterprise deployment
Qwen3.5-35B-A3B outperforming the 235B predecessor demonstrates that architectural efficiency can replace raw parameter scaling, reducing infrastructure costs for organizations.
Consumer-grade multi-GPU setups become viable for frontier-level model inference
Achieving 25 tok/s on three RTX 3090s (72GB total VRAM) makes high-performance local inference accessible without enterprise-grade hardware investment.
โณ Timeline
2026-02-24
Alibaba Qwen team releases Qwen3.5 Medium Model Series including Qwen3.5-122B-A10B, Qwen3.5-35B-A3B, and Qwen3.5-27B
2026-02-25
Qwen3.5-122B-A10B added to Writingmate AI model registry
๐ Sources (7)
Factual claims are grounded in the sources below. Forward-looking analysis is AI-generated interpretation.
- writingmate.ai โ Qwen3.5 122b A10b
- marktechpost.com โ Alibaba Qwen Team Releases Qwen 3 5 Medium Model Series a Production Powerhouse Proving That Smaller AI Models Are Smarter
- unsloth.ai โ Qwen3
- alibabacloud.com โ Models
- forums.developer.nvidia.com โ 361639
- openrouter.ai โ Qwen3.5 122b A10b
- GitHub โ Qwen3
๐ฐ
Weekly AI Recap
Read this week's curated digest of top AI events โ
๐Related Updates
AI-curated news aggregator. All content rights belong to original publishers.
Original source: Reddit r/LocalLLaMA โ

