More MoE experts: meaningful gains?

💡Does scaling MoE experts beyond A3B pay off? Easy Llama.cpp test revives old debate.

⚡ 30-Second TL;DR

What Changed

Debate on Qwen3-30B-A3B vs A6B expert scaling

Why It Matters

Could revive interest in MoE tuning for local LLMs if tests confirm gains, optimizing inference without full model retraining.

What To Do Next

Run benchmarks on Qwen3-30B-A6B in Llama.cpp to test expert scaling on your tasks.

Who should care:Researchers & Academics

Web-grounded analysis with 8 cited sources.

•Qwen3-30B-A3B features 30.5 billion total parameters with 3.3 billion activated, utilizing 48 layers and 128 experts where only 8 are activated per task, supporting a 131K token context window[8][1].
•Community-modified versions like DavidAU's Qwen3-30B-A6B-16-Extreme increase active experts to 16 (activating ~6B parameters), trading inference speed for potentially deeper reasoning on nuanced tasks, with GPU speeds comparable to 6B dense models[2].
•Qwen3-30B-A3B-Thinking-2507 is a specialized variant refined over three months to enhance reasoning quality and depth, while Qwen3 Coder 30B A3B Instruct variant outputs at 25.6 tokens/second on Alibaba's API, ranking low in speed among similar open-weight models[6][3].
•Nemotron-3-Nano-30B-A3B from Nvidia matches Qwen3-30B-A3B in local coding benchmarks but lags in speed for code generation tasks[5].

📊 Competitor Analysis▸ Show

Model	Total Params	Active Params	Key Benchmarks	Speed (t/s)
Qwen3-30B-A3B	30B	3B	ArenaHard: 91.0, AIME’24/25: 80.4	25.6 (API) [3]
Nemotron-3-Nano-30B-A3B	30B	3B	Similar accuracy to GPT OSS 20B in coding evals	Comparable, slightly slower [5]
Qwen3-235B-A22B	235B	22B	Outperforms DeepSeek R1, GPT-4o in coding/math	Faster inference than giants [4]

•Architecture: 30.5B total parameters, 3.3B active; 48 layers; 128 MoE experts with 8 activated by default per forward pass[8].
•Context: Up to 131K tokens input; modified A6B variant supports 32K + 8K output (40K total)[1][2].
•Inference: Base A3B runs at reading speed locally; A6B-16-Extreme halves token/s speed but activates ~6B params for complex tasks; GPU inference 4x-8x faster than CPU[2][4].
•Variants: Qwen3 Coder 30B A3B Instruct scores 20/100 on Intelligence Index (above avg), verbose output (13M tokens vs median 5.6M)[3].

MoE expert scaling beyond A3B will remain niche due to speed trade-offs

Modified A6B configs halve inference speed without proportional benchmark gains, limiting adoption to specialized deep-reasoning use cases[2].

Qwen3-30B-A3B variants will dominate efficient local coding

Strong ArenaHard (91.0) and AIME scores combined with low active params outperform larger dense models at high local speeds[1][4].

Community finetunes like Thinking-2507 will drive iterative MoE improvements

Three months of scaling enhanced reasoning depth, showing viability of targeted post-release optimizations[6].

2025-07

Qwen3-30B-A3B-Thinking-2507 released with three months of reasoning scaling

2025-12

DavidAU releases Qwen3-30B-A6B-16-Extreme finetune increasing experts to 16

Factual claims are grounded in the sources below. Forward-looking analysis is AI-generated interpretation.

Weekly AI Recap

Read this week's curated digest of top AI events →

Same topic

Explore #moe

Same product