🐯虎嗅•Freshcollected in 49m
Gemma 4 Runs 40+ t/s on iPhone Locally

💡40 t/s local Gemma 4 on phones kills cloud dependency for simple tasks
⚡ 30-Second TL;DR
What Changed
E2B (2.3B) and E4B (4.5B) models fit phones with 128K context
Why It Matters
Accelerates on-device AI adoption, pressuring cloud providers to specialize in complex tasks. End-side models handle daily queries, reshaping AI economics.
What To Do Next
Download Google AI Edge Gallery and benchmark Gemma 4 E4B on your smartphone.
Who should care:Developers & AI Engineers
🧠 Deep Insight
AI-generated analysis for this event.
🔑 Enhanced Key Takeaways
- •Google's implementation leverages the Apple Neural Engine (ANE) via the MLX framework, specifically utilizing 4-bit quantization to maintain high token throughput while minimizing memory footprint on mobile devices.
- •The 128K context window is achieved through a combination of Grouped Query Attention (GQA) and a novel sliding-window attention mechanism that reduces KV cache memory overhead during inference.
- •The AI Edge Gallery app utilizes a unified model format (GGUF-compatible) that allows for cross-platform portability, enabling the same model weights to run on both iOS and Android devices with minimal hardware-specific tuning.
📊 Competitor Analysis▸ Show
| Feature | Google Gemma 4 (E4B) | Apple OpenELM (3B) | Meta Llama 3.2 (3B) |
|---|---|---|---|
| Architecture | Dense/MoE Hybrid | Sparse/Dense | Dense |
| Context Window | 128K | 4K | 128K |
| Mobile Optimization | Native MLX/ANE | Native MLX | Via CoreML/ExecuTorch |
| License | Gemma Terms | Sample Code License | Llama 3.2 Community License |
🛠️ Technical Deep Dive
- Quantization Strategy: Employs 4-bit weight-only quantization (NF4) to fit the E4B model within the restricted VRAM/RAM shared environment of the iPhone 16 Pro/Max series.
- Memory Management: Uses a custom KV cache eviction policy that dynamically compresses historical tokens when the 128K limit is approached, prioritizing recent context for agentic tasks.
- Inference Engine: Built on MLX, which performs graph fusion to minimize kernel launches, significantly reducing latency for small-batch, high-frequency token generation.
- Multimodal Pipeline: The vision encoder uses a lightweight projection layer that maps image embeddings directly into the LLM's latent space, bypassing the need for a separate heavy vision transformer during inference.
🔮 Future ImplicationsAI analysis grounded in cited sources
Cloud-based LLM providers will face significant revenue erosion in the 'personal assistant' segment.
Local execution removes the per-token cost and latency barriers that currently make cloud-based agents impractical for high-frequency, privacy-sensitive mobile interactions.
Mobile hardware specifications will shift from 'RAM capacity' to 'NPU TOPS' as the primary differentiator for AI performance.
As models like Gemma 4 become standard, the bottleneck for user experience will move from memory availability to the raw throughput of the Neural Processing Unit.
⏳ Timeline
2024-02
Google releases the original Gemma 1 family of open models.
2024-05
Google introduces Gemma 2 with improved architecture and performance.
2025-03
Google releases Gemma 3, focusing on native multimodal capabilities.
2026-02
Google launches Gemma 4, emphasizing extreme efficiency and mobile-first deployment.
📰
Weekly AI Recap
Read this week's curated digest of top AI events →
👉Related Updates
AI-curated news aggregator. All content rights belong to original publishers.
Original source: 虎嗅 ↗