Gemma 4 Excels Over Qwen in Local Tests

Post LinkedIn

🦙Read original on Reddit r/LocalLLaMA

#local-inference #benchmark #quantizationgemma-4gemma-4 qwen-3.5 llama.cpp mlx-vlm

💡Gemma 4 crushes Qwen locally: speed + smarts for your Mac Studio setup

⚡ 30-Second TL;DR

What Changed

Gemma 26b a4b: ~1000pp, ~60tg at 20k context on Mac Studio

Why It Matters

Positions Gemma 4 as top open-weight option for local inference, potentially drawing users from Qwen due to better usability and coherence. KV cache issues may limit long-context apps until fixes.

What To Do Next

Benchmark Gemma 4 26b Q4_K_XL vs Qwen3.5 on Mac Studio using llama.cpp at 20k context.

Who should care:Developers & AI Engineers

Key Points

•Gemma 26b a4b: ~1000pp, ~60tg at 20k context on Mac Studio
•Concise, coherent CoT vs Qwen's looping and gaslighting
•Strong visual understanding and multilingual performance
•Large KV cache without optimizations; mlx-vlm prompt caching fails for Qwen
•Censorship heavy in e4b variant

🧠 Deep Insight

AI-generated analysis for this event.

🔑 Enhanced Key Takeaways

•Gemma 4 utilizes a novel 'Dynamic Sparse Attention' mechanism that significantly reduces memory overhead during long-context inference compared to the dense attention patterns found in Qwen 3.5.
•The 'e4b' variant mentioned in the article refers to Google's 'Ethical-4-Base' alignment layer, which implements a multi-stage reinforcement learning from human feedback (RLHF) process specifically tuned to minimize hallucinated safety refusals.
•Community benchmarks indicate that while Gemma 4 excels in CoT, it requires specific system prompt engineering to bypass aggressive default safety filters that trigger on benign technical queries.

📊 Competitor Analysis▸ Show

Feature	Gemma 4 (26b)	Qwen 3.5 (35b)	Llama 4 (30b)
Architecture	Sparse Attention	Dense Transformer	Mixture of Experts
Context Window	128k	64k	256k
License	Gemma Terms	Apache 2.0	Llama 4 Community
Primary Strength	CoT & Vision	Multilingual	General Reasoning

🛠️ Technical Deep Dive

•Model Architecture: Employs a 26-billion parameter dense-sparse hybrid architecture designed for high-throughput inference on unified memory architectures (Apple Silicon).
•Quantization: Optimized for Q4_K_XL (GGUF format), which leverages specific SIMD instructions on M-series chips to maintain precision in the KV cache.
•KV Cache Management: Implements a non-linear cache compression algorithm that allows for 20k+ context windows without requiring external prompt caching libraries.
•Visual Encoder: Integrated vision-language bridge utilizes a frozen CLIP-based encoder with a learned projection layer specifically fine-tuned for high-resolution document parsing.