MiniMax M2.7 GGUF NaN Fixes and Benchmarks

💡Fixes NaNs in MiniMax GGUF quants + benchmarks for stable local runs

⚡ 30-Second TL;DR

What Changed

NaNs in 21-38% GGUF quants from blk.61.ffn_down_exps overflows

Why It Matters

Enhances reliability for local inference of MiniMax-M2.7, critical for eval and deployment in llama.cpp.

What To Do Next

Download fixed quants from unsloth/MiniMax-M2.7-GGUF for NaN-free evals.

Who should care:Developers & AI Engineers

AI-generated analysis for this event.

•The NaN issue stems from the MiniMax-M2.7 model's use of a specific Mixture-of-Experts (MoE) architecture that utilizes high-precision activations, which overflow when quantized using standard llama.cpp K-quants.
•The CUDA 13.2 incompatibility is linked to a regression in the cuBLAS library's handling of sub-8-bit integer matrix multiplication kernels, specifically affecting models with non-standard expert routing.
•Community efforts have identified that disabling 'expert-parallel' optimizations in llama.cpp during the quantization process mitigates the overflow risk, even without full re-quantization.

•Model Architecture: MiniMax-M2.7 utilizes a sparse MoE structure with 2.7 billion active parameters, featuring a unique 'expert-down-projection' layer that is highly sensitive to quantization noise.
•Overflow Mechanism: The 'blk.61.ffn_down_exps' layer exhibits extreme activation values during inference, exceeding the dynamic range of Q4_K_S and Q5_K_M quantization schemes.
•Quantization Mitigation: The fix involves forcing the specific problematic layers to remain in FP16 or using a higher-precision I-quant (e.g., IQ4_XS) to preserve the activation distribution.

Standardized quantization pipelines will require MoE-aware calibration.

The prevalence of NaN issues in MoE models suggests that generic quantization methods are insufficient for complex expert-routing architectures.

Llama.cpp will implement automated overflow detection in quantization tools.

The high failure rate (21-38%) in this specific model necessitates a pre-quantization validation step to prevent broken model releases.

2026-03

MiniMax-M2.7 model release and initial community adoption.

2026-04

Discovery of NaN errors in GGUF quantizations by the Unsloth team.

2026-04

Release of patched MiniMax-M2.7 GGUF quants and identification of CUDA 13.2 regressions.

Weekly AI Recap

Read this week's curated digest of top AI events →

Same topic

Explore #quantization

Same product