🦙Reddit r/LocalLLaMA•Stalecollected in 74m
Gemma 4 broken on Unsloth and llama.cpp

💡Gemma 4 fails local runs—critical bug for offline LLM users
⚡ 30-Second TL;DR
What Changed
Fails to list typos correctly from articles locally
Why It Matters
Highlights compatibility issues in local inference setups, potentially delaying adoption of Gemma 4 for offline use until fixed.
What To Do Next
Test Gemma 4 on llama.cpp with a typo detection prompt from BBC articles to verify issues.
Who should care:Developers & AI Engineers
🧠 Deep Insight
AI-generated analysis for this event.
🔑 Enhanced Key Takeaways
- •The issue appears linked to specific GGUF conversion artifacts in the latest llama.cpp release, where tensor mapping for MoE (Mixture of Experts) layers in Gemma 4 models is causing weight misalignment during inference.
- •Community investigation suggests that the 'nonsensical output' is a result of the KV cache being incorrectly initialized for the 26B MoE architecture, leading to catastrophic attention score degradation.
- •Unsloth maintainers have identified that the current quantization kernels for Gemma 4 are incompatible with the specific activation scaling factors used in the model's final normalization layer, necessitating a patch to the quantization pipeline.
📊 Competitor Analysis▸ Show
| Feature | Gemma 4 (26B/31B) | Llama 3.3 (70B) | Mistral Large 3 |
|---|---|---|---|
| Architecture | MoE / Dense | Dense | Dense |
| Local Support | High (Community) | Native (llama.cpp) | Native (llama.cpp) |
| Quantization | Unsloth/GGUF | Full GGUF/EXL2 | Full GGUF/EXL2 |
| Licensing | Google Gemma | Meta Llama 3 | Apache 2.0 |
🛠️ Technical Deep Dive
- •Gemma 4 utilizes a modified RoPE (Rotary Positional Embedding) implementation that requires specific theta values (base frequency) which were not correctly mapped in the latest llama.cpp GGUF conversion scripts.
- •The 26B MoE variant employs a top-k routing mechanism where the expert selection indices are being corrupted during the Q8_0 quantization process, causing the model to route to inactive or zero-initialized experts.
- •The model architecture includes a unique 'Logit Soft-Capping' layer that, when quantized, suffers from precision loss, leading to the observed nonsensical output generation.
🔮 Future ImplicationsAI analysis grounded in cited sources
Llama.cpp will release a hotfix for MoE tensor mapping within 72 hours.
The high volume of community bug reports on GitHub and Reddit regarding Gemma 4 has triggered an active investigation by core maintainers.
Unsloth will update its quantization export pipeline to include explicit support for Gemma 4's logit soft-capping.
The current failure to handle the model's specific normalization layers necessitates a change in the export logic to prevent output degradation.
⏳ Timeline
2026-03
Google releases Gemma 4 series, including 26B MoE and 31B dense models.
2026-03
Unsloth adds initial support for fine-tuning Gemma 4 models.
2026-04
Users report widespread inference failures for quantized Gemma 4 models on local hardware.
📰 Event Coverage
📰
Weekly AI Recap
Read this week's curated digest of top AI events →
👉Related Updates
AI-curated news aggregator. All content rights belong to original publishers.
Original source: Reddit r/LocalLLaMA ↗

