Gemma 4 broken on Unsloth and llama.cpp

💡Gemma 4 fails local runs—critical bug for offline LLM users

⚡ 30-Second TL;DR

What Changed

Fails to list typos correctly from articles locally

Why It Matters

Highlights compatibility issues in local inference setups, potentially delaying adoption of Gemma 4 for offline use until fixed.

What To Do Next

Test Gemma 4 on llama.cpp with a typo detection prompt from BBC articles to verify issues.

Who should care:Developers & AI Engineers

AI-generated analysis for this event.

•The issue appears linked to specific GGUF conversion artifacts in the latest llama.cpp release, where tensor mapping for MoE (Mixture of Experts) layers in Gemma 4 models is causing weight misalignment during inference.
•Community investigation suggests that the 'nonsensical output' is a result of the KV cache being incorrectly initialized for the 26B MoE architecture, leading to catastrophic attention score degradation.
•Unsloth maintainers have identified that the current quantization kernels for Gemma 4 are incompatible with the specific activation scaling factors used in the model's final normalization layer, necessitating a patch to the quantization pipeline.

📊 Competitor Analysis▸ Show

Feature	Gemma 4 (26B/31B)	Llama 3.3 (70B)	Mistral Large 3
Architecture	MoE / Dense	Dense	Dense
Local Support	High (Community)	Native (llama.cpp)	Native (llama.cpp)
Quantization	Unsloth/GGUF	Full GGUF/EXL2	Full GGUF/EXL2
Licensing	Google Gemma	Meta Llama 3	Apache 2.0

•Gemma 4 utilizes a modified RoPE (Rotary Positional Embedding) implementation that requires specific theta values (base frequency) which were not correctly mapped in the latest llama.cpp GGUF conversion scripts.
•The 26B MoE variant employs a top-k routing mechanism where the expert selection indices are being corrupted during the Q8_0 quantization process, causing the model to route to inactive or zero-initialized experts.
•The model architecture includes a unique 'Logit Soft-Capping' layer that, when quantized, suffers from precision loss, leading to the observed nonsensical output generation.