๐คReddit r/MachineLearningโขStalecollected in 2h
Qwen3.5 MoE: Breakthrough or Incremental?
๐ก397B MoE with ultra-low active params: open-source game-changer?
โก 30-Second TL;DR
What Changed
397B total parameters with only 17B active in MoE setup
Why It Matters
If breakthrough, it could enable more efficient training and inference for massive open-source models, democratizing high-performance AI.
What To Do Next
Download and benchmark Qwen3.5-397B-A17B on your MoE routing tasks.
Who should care:Researchers & Academics
๐ง Deep Insight
Web-grounded analysis with 8 cited sources.
๐ Enhanced Key Takeaways
- โขQwen3.5 introduces a Shared Expert in its MoE architecture, a dedicated dense MLP that processes every token alongside top-8 routed experts out of 64 for enhanced stability[1].
- โขIt employs a hybrid attention mechanism with Gated Delta Networks in 75% of layers for linear complexity, enabling native support for up to 262k token contexts and reduced KV-cache memory[2][4][5].
- โขThe model supports native multimodality as a visual agent via DeepStack, 3D convolutions, and mRoPE, optimized for AMD Instinct and NVIDIA GPUs[1][6].
๐ ๏ธ Technical Deep Dive
- โขMoE setup: 397B total parameters, 17B active per token; includes Shared Expert (universal dense MLP) + routed experts (top-8 of 64 via Top-K Router)[1][5].
- โขAttention: Hybrid Gated DeltaNet (linear attention) + full attention (75% linear layers), achieving linear scaling for long contexts up to 262k tokens[1][2][4][5].
- โขMultimodal: Native VLM with DeepStack, 3D convolutions, mRoPE positional embeddings; supports UI navigation and visual reasoning[1][6].
- โขOptimization: hipBLASLt for Shared Expert GEMM, AITER FusedMoE for routed experts (AMD); MIOpen/PyTorch for vision; runs on single AMD Instinct GPU[1].
- โขVariants: Smaller models like Qwen3.5-35B-A3B (3B active, outperforms prior 235B-A22B), Qwen3.5-122B-A10B, with dual-mode thinking/non-thinking[2][3][4].
๐ฎ Future ImplicationsAI analysis grounded in cited sources
Qwen3.5 hybrid MoE will reduce inference costs by 3-6x for long-context agents
Linear attention and sparse activation enable 1M token processing with minimal compute growth, as shown in benchmarks against dense models[2].
โณ Timeline
2026-02
Qwen3 team releases Qwen3-Coder-Next 80B (3B active) with early hybrid attention
2026-02
Qwen3-235B-A22B released as prior flagship MoE model
2026-02-16
Qwen3.5 first release: 397B-A17B MoE on GitHub and blog
2026-02
Qwen3.5 medium models (35B-A3B, 122B-A10B, 27B) announced post-397B
๐ Sources (8)
Factual claims are grounded in the sources below. Forward-looking analysis is AI-generated interpretation.
- amd.com โ Day 0 Support for Qwen 3 5 on Amd Instinct Gpus
- digitalapplied.com โ Qwen 3 5 Medium Model Series Benchmarks Pricing Guide
- siliconflow.com โ The Best Qwen3 Models in 2025
- kaitchup.substack.com โ Qwen35 Medium Models Dense vs Moe
- magazine.sebastianraschka.com โ A Dream of Spring for Open Weight
- developer.nvidia.com โ Develop Native Multimodal Agents with Qwen3 5 Vlm Using Nvidia GPU Accelerated Endpoints
- qwen.ai โ Blog
- GitHub โ Qwen3
๐ฐ
Weekly AI Recap
Read this week's curated digest of top AI events โ
๐Related Updates
AI-curated news aggregator. All content rights belong to original publishers.
Original source: Reddit r/MachineLearning โ

