Is the 120B MoE model family dying?

🔑 Enhanced Key Takeaways

•OpenAI/NVIDIA released gpt-oss-120b (approximately 117B total parameters, ~5.1B active) in late 2025, demonstrating continued development in the 120B-class Mixture-of-Experts (MoE) segment, contrary to the article's observation of a lack of recent updates.
•MoE architectures achieve high total parameter counts (e.g., 46.7B for Mixtral 8x7B) while maintaining significantly lower active parameter counts (~12.9B for Mixtral) for efficient inference, often outperforming dense models of similar active parameter size.
•Beyond the 200B+ range, ultra-large MoE models like DeepSeek-V3 (671B total parameters, 37B active) and Zhipu GLM-5.1 (744B total parameters, 40B active) have emerged, pushing the frontier of model capacity while maintaining inference efficiency through sparsity.
•The shift towards smaller models (25B-35B) is significantly driven by the demand for cost-effective, scalable, and often multimodal solutions for specific enterprise applications and local deployments, where smaller models offer better control and lower latency.

🛠️ Technical Deep Dive

Core Mechanism: Mixture-of-Experts (MoE) models utilize a 'gating network' or 'router' to selectively activate a subset of 'experts' (specialized neural subnetworks, typically Feed-Forward Networks) for each input token.
Sparsity and Efficiency: This conditional computation allows for a massive increase in total parameters without a proportional increase in computations per token, effectively decoupling model capacity from per-token compute cost.
Inference Economics: MoE models are generally more cost-effective for inference compared to dense models of the same total parameter count, partly due to their tendency to be shallower and wider, which reduces network communication overhead.
Parameter Scaling Examples: Mixtral 8x7B, for instance, has approximately 46.7 billion total parameters, but with 8 experts and 2 active per token, it utilizes around 12.9 billion active parameters during inference. OpenAI's gpt-oss-120b has ~117 billion total parameters and ~5.1 billion active parameters per forward pass, employing 128 experts with 4 active per token. DeepSeek-V3 features 671 billion total parameters with 37 billion activated per token.
Training Challenges: MoE models can be more challenging to train due to issues like 'expert collapse,' where only a few experts become dominant. Solutions include auxiliary losses and capacity caps.
Memory Requirements: Despite sparse activation, MoE models often require high VRAM because all experts must be loaded into memory, even if only a subset is active for a given token.
Quantization for Deployment: To facilitate deployment on more constrained hardware, models like gpt-oss-120b leverage MXFP4 quantization for the linear projection weights within their MoE layers, enabling operation on a single 80GB GPU.
Advanced Architectures: The DeepSeekMoE architecture introduces strategies like fine-grained expert segmentation and shared experts to enhance specialization and mitigate redundancy among experts.

🔮 Future ImplicationsAI analysis grounded in cited sources

The development of MoE architectures will continue to focus on optimizing the balance between total and active parameters.

The efficiency gains of MoE models are intrinsically linked to this balance, allowing for increased model capacity without prohibitive inference costs, which is crucial for continued scaling.

Hybrid model architectures combining dense and sparse (MoE) layers will become more prevalent.

Hybrid approaches, such as Arctic (combining a 10B dense transformer with a 128x3.36B MoE transformer), aim to leverage the strengths of both, offering good training efficiency and enhanced capabilities for larger parameter sizes.

Specialized, smaller MoE models will gain traction for edge and domain-specific applications.

The increasing demand for cost-effective, low-latency, and customizable AI solutions for specific tasks drives the adoption of smaller models, and MoE can be applied to these smaller scales for modular general intelligence.

⏳ Timeline

1991

Robert Jacobs and Geoffrey Hinton introduce 'Adaptive Mixtures of Local Experts,' laying the theoretical foundation for MoE.

2017

Google introduces the Sparsely-Gated Mixture-of-Experts layer, enabling MoE application in large-scale deep learning.

2023-12

Mistral AI releases Mixtral 8x7B, a prominent open-source MoE model with ~46.7B total parameters and ~12.9B active parameters.

2024-05

DeepSeek V2 is introduced, featuring a 236B-parameter MoE architecture with 21B activated parameters per token.

2024-12

DeepSeek-V3, a 671B-parameter MoE model (37B active), is released, setting new benchmarks for open-source models.

2025-08

OpenAI releases `gpt-oss-120b`, an open-weight MoE LLM with ~117B total parameters and ~5.1B active parameters.

2026-04

Zhipu AI releases GLM-5.1, a 744B-parameter MoE model with 40B active parameters, under an MIT license.

Is the 120B MoE model family dying?

⚡ 30-Second TL;DR

🧠 Deep Insight

🔑 Enhanced Key Takeaways

🛠️ Technical Deep Dive

🔮 Future ImplicationsAI analysis grounded in cited sources

⏳ Timeline

📎 Sources (30)

👉Related Updates

AI Era Threatens Niche Smartphone Brands