🦙Stalecollected in 74m

Gemma 4 broken on Unsloth and llama.cpp

Gemma 4 broken on Unsloth and llama.cpp
PostLinkedIn
🦙Read original on Reddit r/LocalLLaMA

💡Gemma 4 fails local runs—critical bug for offline LLM users

⚡ 30-Second TL;DR

What Changed

Fails to list typos correctly from articles locally

Why It Matters

Highlights compatibility issues in local inference setups, potentially delaying adoption of Gemma 4 for offline use until fixed.

What To Do Next

Test Gemma 4 on llama.cpp with a typo detection prompt from BBC articles to verify issues.

Who should care:Developers & AI Engineers

🧠 Deep Insight

AI-generated analysis for this event.

🔑 Enhanced Key Takeaways

  • The issue appears linked to specific GGUF conversion artifacts in the latest llama.cpp release, where tensor mapping for MoE (Mixture of Experts) layers in Gemma 4 models is causing weight misalignment during inference.
  • Community investigation suggests that the 'nonsensical output' is a result of the KV cache being incorrectly initialized for the 26B MoE architecture, leading to catastrophic attention score degradation.
  • Unsloth maintainers have identified that the current quantization kernels for Gemma 4 are incompatible with the specific activation scaling factors used in the model's final normalization layer, necessitating a patch to the quantization pipeline.
📊 Competitor Analysis▸ Show
FeatureGemma 4 (26B/31B)Llama 3.3 (70B)Mistral Large 3
ArchitectureMoE / DenseDenseDense
Local SupportHigh (Community)Native (llama.cpp)Native (llama.cpp)
QuantizationUnsloth/GGUFFull GGUF/EXL2Full GGUF/EXL2
LicensingGoogle GemmaMeta Llama 3Apache 2.0

🛠️ Technical Deep Dive

  • Gemma 4 utilizes a modified RoPE (Rotary Positional Embedding) implementation that requires specific theta values (base frequency) which were not correctly mapped in the latest llama.cpp GGUF conversion scripts.
  • The 26B MoE variant employs a top-k routing mechanism where the expert selection indices are being corrupted during the Q8_0 quantization process, causing the model to route to inactive or zero-initialized experts.
  • The model architecture includes a unique 'Logit Soft-Capping' layer that, when quantized, suffers from precision loss, leading to the observed nonsensical output generation.

🔮 Future ImplicationsAI analysis grounded in cited sources

Llama.cpp will release a hotfix for MoE tensor mapping within 72 hours.
The high volume of community bug reports on GitHub and Reddit regarding Gemma 4 has triggered an active investigation by core maintainers.
Unsloth will update its quantization export pipeline to include explicit support for Gemma 4's logit soft-capping.
The current failure to handle the model's specific normalization layers necessitates a change in the export logic to prevent output degradation.

Timeline

2026-03
Google releases Gemma 4 series, including 26B MoE and 31B dense models.
2026-03
Unsloth adds initial support for fine-tuning Gemma 4 models.
2026-04
Users report widespread inference failures for quantized Gemma 4 models on local hardware.

📰 Event Coverage

📰

Weekly AI Recap

Read this week's curated digest of top AI events →

👉Related Updates

AI-curated news aggregator. All content rights belong to original publishers.
Original source: Reddit r/LocalLLaMA