๐Ÿฆ™Stalecollected in 5h

More MoE experts: meaningful gains?

PostLinkedIn
๐Ÿฆ™Read original on Reddit r/LocalLLaMA

๐Ÿ’กDoes scaling MoE experts beyond A3B pay off? Easy Llama.cpp test revives old debate.

โšก 30-Second TL;DR

What Changed

Debate on Qwen3-30B-A3B vs A6B expert scaling

Why It Matters

Could revive interest in MoE tuning for local LLMs if tests confirm gains, optimizing inference without full model retraining.

What To Do Next

Run benchmarks on Qwen3-30B-A6B in Llama.cpp to test expert scaling on your tasks.

Who should care:Researchers & Academics

๐Ÿง  Deep Insight

Web-grounded analysis with 8 cited sources.

๐Ÿ”‘ Enhanced Key Takeaways

  • โ€ขQwen3-30B-A3B features 30.5 billion total parameters with 3.3 billion activated, utilizing 48 layers and 128 experts where only 8 are activated per task, supporting a 131K token context window[8][1].
  • โ€ขCommunity-modified versions like DavidAU's Qwen3-30B-A6B-16-Extreme increase active experts to 16 (activating ~6B parameters), trading inference speed for potentially deeper reasoning on nuanced tasks, with GPU speeds comparable to 6B dense models[2].
  • โ€ขQwen3-30B-A3B-Thinking-2507 is a specialized variant refined over three months to enhance reasoning quality and depth, while Qwen3 Coder 30B A3B Instruct variant outputs at 25.6 tokens/second on Alibaba's API, ranking low in speed among similar open-weight models[6][3].
  • โ€ขNemotron-3-Nano-30B-A3B from Nvidia matches Qwen3-30B-A3B in local coding benchmarks but lags in speed for code generation tasks[5].
๐Ÿ“Š Competitor Analysisโ–ธ Show
ModelTotal ParamsActive ParamsKey BenchmarksSpeed (t/s)
Qwen3-30B-A3B30B3BArenaHard: 91.0, AIMEโ€™24/25: 80.425.6 (API) [3]
Nemotron-3-Nano-30B-A3B30B3BSimilar accuracy to GPT OSS 20B in coding evalsComparable, slightly slower [5]
Qwen3-235B-A22B235B22BOutperforms DeepSeek R1, GPT-4o in coding/mathFaster inference than giants [4]

๐Ÿ› ๏ธ Technical Deep Dive

  • โ€ขArchitecture: 30.5B total parameters, 3.3B active; 48 layers; 128 MoE experts with 8 activated by default per forward pass[8].
  • โ€ขContext: Up to 131K tokens input; modified A6B variant supports 32K + 8K output (40K total)[1][2].
  • โ€ขInference: Base A3B runs at reading speed locally; A6B-16-Extreme halves token/s speed but activates ~6B params for complex tasks; GPU inference 4x-8x faster than CPU[2][4].
  • โ€ขVariants: Qwen3 Coder 30B A3B Instruct scores 20/100 on Intelligence Index (above avg), verbose output (13M tokens vs median 5.6M)[3].

๐Ÿ”ฎ Future ImplicationsAI analysis grounded in cited sources

MoE expert scaling beyond A3B will remain niche due to speed trade-offs
Modified A6B configs halve inference speed without proportional benchmark gains, limiting adoption to specialized deep-reasoning use cases[2].
Qwen3-30B-A3B variants will dominate efficient local coding
Strong ArenaHard (91.0) and AIME scores combined with low active params outperform larger dense models at high local speeds[1][4].
Community finetunes like Thinking-2507 will drive iterative MoE improvements
Three months of scaling enhanced reasoning depth, showing viability of targeted post-release optimizations[6].

โณ Timeline

2025-07
Qwen3-30B-A3B-Thinking-2507 released with three months of reasoning scaling
2025-12
DavidAU releases Qwen3-30B-A6B-16-Extreme finetune increasing experts to 16
๐Ÿ“ฐ

Weekly AI Recap

Read this week's curated digest of top AI events โ†’

๐Ÿ‘‰Related Updates

AI-curated news aggregator. All content rights belong to original publishers.
Original source: Reddit r/LocalLLaMA โ†—