MiniMax Drops Hybrid Attention for Full Attention

๐กWhy MiniMax ditched hybrid attention + new OS models beating it on long context
โก 30-Second TL;DR
What Changed
Hybrid attention matched full attention on saturated benchmarks like MMLU but failed in complex reasoning
Why It Matters
Highlights need for advanced evals beyond standard benchmarks, influencing future architecture choices in LLMs.
What To Do Next
Test DeepSeek V3.2 on multi-hop reasoning benchmarks to compare with hybrid setups.
๐ง Deep Insight
Web-grounded analysis with 6 cited sources.
๐ Enhanced Key Takeaways
- โขMiniMax-M2 switched to full attention after discovering that hybrid attention models exhibited clear deficits in complex, multi-hop reasoning tasks at scale, despite performing equivalently on saturated benchmarks like MMLU and BBH[4]. This finding prompted MiniMax to develop internal proxy metrics to evaluate reasoning quality beyond public leaderboards[4].
- โขInfrastructure maturity significantly impacts attention mechanism adoption: linear and sparse attention implementations lack the mature infrastructure of full attention, requiring substantial groundwork to deliver promised efficiency gains in production systems[4].
- โขMiniMax-M2.5 notably diverged from the industry trend toward hybrid linear attention by adopting a classical architecture with only Grouped Query Attention (GQA), achieving competitive performance as a smaller, more cost-effective alternative despite the broader shift toward hybrid mechanisms[5].
- โขMoonshot AI's Kimi Linear model claims to surpass full attention quality, but these claims are constrained by limited scale (~1.4T training tokens) and lack of well-tuned full-attention baselines, suggesting the debate remains unresolved at production scales[1].
- โขMiniMax's experience reveals a fundamental evaluation challenge: benchmarks function as 'leaky abstractions' that mask performance degradation in specific reasoning domains, necessitating custom proxy metrics and raising questions about metric correlation as models scale further[4].
๐ Competitor Analysisโธ Show
| Model | Architecture | Parameters | Context Window | Key Strength | Attention Type |
|---|---|---|---|---|---|
| MiniMax-M2 | MoE | ~230B active | 200k+ | Full attention for multi-hop reasoning | Full Attention |
| MiniMax-M1 | Hybrid MoE | 45.9B active / 456B total | 1M | Lightning attention efficiency (75% FLOPs savings) | Hybrid (Lightning + Full) |
| Kimi Linear | Linear Attention | ~1.4T tokens trained | Not specified | Hardware-efficient linear mechanism | Linear (Delta Attention) |
| Qwen3-Next-80B | Hybrid MoE | 80B | 262k+ | Multi-hop reasoning (87.8% AIME25) | Hybrid |
| DeepSeek-R1 | Standard | Comparable to o1 | Not specified | RL-trained reasoning | Not specified |
๐ ๏ธ Technical Deep Dive
- โขMiniMax-M2.5 uses Grouped Query Attention (GQA) exclusively without sliding window attention or other efficiency modifications, representing a deliberate choice for architectural simplicity over hybrid mechanisms[5].
- โขMiniMax-M1 employs a 3:1 ratio of Gated DeltaNet blocks to Gated Attention blocks, mixing linear-attention-like efficiency with full attention's content-based retrieval precision[5].
- โขLightning Attention in MiniMax-M1 achieves 75% FLOPs savings compared to DeepSeek R1 at 100K token context lengths through optimized kernel implementations[2].
- โขMiniMax developed internal proxy metrics to evaluate multi-hop reasoning deficits that public benchmarks failed to capture, though questions remain about metric correlation at larger scales[4].
- โขKimi Linear's Delta Attention mechanism refines the gated delta rule with open-sourced KDA kernels and vLLM integration, but comparisons remain limited to ~1.4T training tokens[1].
๐ฎ Future ImplicationsAI analysis grounded in cited sources
โณ Timeline
๐ Sources (6)
Factual claims are grounded in the sources below. Forward-looking analysis is AI-generated interpretation.
- kaitchup.substack.com โ Minimax M2 and Kimi Linear Why Full
- siliconflow.com โ The Best Minimaxai Models in 2025
- promnest.com โ Mastering Minimax M25 the New Frontier of Agentic Prompting and Linear Efficiency
- minimax.io โ Why Did M2 End Up As a Full Attention Model
- magazine.sebastianraschka.com โ A Dream of Spring for Open Weight
- clarifai.com โ Top 10 Open Source Reasoning Models in 2026
Weekly AI Recap
Read this week's curated digest of top AI events โ
๐Related Updates
AI-curated news aggregator. All content rights belong to original publishers.
Original source: Reddit r/LocalLLaMA โ