🤖Stalecollected in 37h

397B Qwen3-Next Hits 1T Performance—How?

PostLinkedIn
🤖Read original on Reddit r/MachineLearning

💡397B model at 1T perf? Arch or data tricks—key for efficient LLM scaling

⚡ 30-Second TL;DR

What Changed

397B model delivers 1T performance metric

Why It Matters

Highlights potential breakthroughs in efficient large model inference, relevant for high-throughput LLM deployments.

What To Do Next

Review Qwen3-Next benchmarks on Hugging Face to compare 397B inference speeds.

Who should care:Researchers & Academics

🧠 Deep Insight

Web-grounded analysis with 6 cited sources.

🔑 Enhanced Key Takeaways

  • Qwen3.5-397B-A17B uses a Hybrid Mixture-of-Experts (MoE) architecture with 397 billion total parameters but only 17 billion active per token, enabling 1T-level performance through efficiency gains[1][2].
  • The model achieves 19x faster decoding on long-context tasks (256k tokens) and 8.6x faster for standard workflows compared to Qwen3-Max, while matching its reasoning and coding capabilities[1].
  • FP8 precision reduces memory usage by 50% and boosts speeds by over 10% at trillion-token scale, combined with high-quality visual-text data filtering to rival larger 1T-parameter models[1].
  • Features native multimodality with early fusion vision-language training, supporting chat, RAG, vision-language understanding, video understanding, and agentic workflows[2][5].
  • Positioned as competitive with top models like Gemini 3 Pro and Claude Opus, with strong benchmark performance but not claiming SOTA in coding[3][4].
📊 Competitor Analysis▸ Show
Feature/BenchmarkQwen3.5-397B-A17BQwen3-MaxQwen3-Next-80B-A3B
Total Parameters397B (17B active)>1T80B (3B active)
Speed (vs Qwen3-Max)19x faster (long-context)BaselineN/A
BenchmarksMatches reasoning/coding; outperforms Qwen3-VLStrong baselineOutperformed in 10 benchmarks (e.g., GPQA, LiveCodeBench)[1][3]
Context Length262k native (up to 1M)N/AN/A
PricingCost-efficient (50% less memory)HigherN/A

🛠️ Technical Deep Dive

  • Architecture: Hybrid MoE with 512 total experts (10 routed + 1 shared per token); 60 layers; hidden dimension 4,096; layout: 15 × (3 × (Gated DeltaNet → MoE) → 1 × (Gated Attention → MoE))[2].
  • Attention: Gated DeltaNet (64 linear heads for V, 16 for QK, head dim 128); Gated Attention (32 heads for Q, 2 for KV, head dim 256; RoPE dim 64)[2].
  • MoE Details: Expert intermediate dimension 1,024; vocabulary 248,320; input context 262,144 tokens (extensible to 1,010,000 via YaRN)[2].
  • Multimodal: Early fusion vision-language training; supports text/video inputs; operates in thinking mode with reasoning details[2][6].
  • Optimizations: FP8 pipeline for 50% memory reduction; NVIDIA GPU-optimized for faster inference[1][2].

🔮 Future ImplicationsAI analysis grounded in cited sources

Qwen3.5-397B-A17B demonstrates MoE efficiency can deliver 1T-scale performance from sub-400B models, lowering costs and enabling broader deployment of multimodal agents; accelerates Chinese open model competition, pressuring labs like DeepSeek for v4 refresh while advancing native spatial intelligence and agentic workflows[1][4].

Timeline

2026-02
Qwen3.5-397B-A17B released as efficient MoE multimodal model matching Qwen3-Max performance with 19x speed gains[1][2][4]
📰

Weekly AI Recap

Read this week's curated digest of top AI events →

👉Related Updates

AI-curated news aggregator. All content rights belong to original publishers.
Original source: Reddit r/MachineLearning