🐯Freshcollected in 49m

Gemma 4 Runs 40+ t/s on iPhone Locally

Gemma 4 Runs 40+ t/s on iPhone Locally
PostLinkedIn
🐯Read original on 虎嗅

💡40 t/s local Gemma 4 on phones kills cloud dependency for simple tasks

⚡ 30-Second TL;DR

What Changed

E2B (2.3B) and E4B (4.5B) models fit phones with 128K context

Why It Matters

Accelerates on-device AI adoption, pressuring cloud providers to specialize in complex tasks. End-side models handle daily queries, reshaping AI economics.

What To Do Next

Download Google AI Edge Gallery and benchmark Gemma 4 E4B on your smartphone.

Who should care:Developers & AI Engineers

🧠 Deep Insight

AI-generated analysis for this event.

🔑 Enhanced Key Takeaways

  • Google's implementation leverages the Apple Neural Engine (ANE) via the MLX framework, specifically utilizing 4-bit quantization to maintain high token throughput while minimizing memory footprint on mobile devices.
  • The 128K context window is achieved through a combination of Grouped Query Attention (GQA) and a novel sliding-window attention mechanism that reduces KV cache memory overhead during inference.
  • The AI Edge Gallery app utilizes a unified model format (GGUF-compatible) that allows for cross-platform portability, enabling the same model weights to run on both iOS and Android devices with minimal hardware-specific tuning.
📊 Competitor Analysis▸ Show
FeatureGoogle Gemma 4 (E4B)Apple OpenELM (3B)Meta Llama 3.2 (3B)
ArchitectureDense/MoE HybridSparse/DenseDense
Context Window128K4K128K
Mobile OptimizationNative MLX/ANENative MLXVia CoreML/ExecuTorch
LicenseGemma TermsSample Code LicenseLlama 3.2 Community License

🛠️ Technical Deep Dive

  • Quantization Strategy: Employs 4-bit weight-only quantization (NF4) to fit the E4B model within the restricted VRAM/RAM shared environment of the iPhone 16 Pro/Max series.
  • Memory Management: Uses a custom KV cache eviction policy that dynamically compresses historical tokens when the 128K limit is approached, prioritizing recent context for agentic tasks.
  • Inference Engine: Built on MLX, which performs graph fusion to minimize kernel launches, significantly reducing latency for small-batch, high-frequency token generation.
  • Multimodal Pipeline: The vision encoder uses a lightweight projection layer that maps image embeddings directly into the LLM's latent space, bypassing the need for a separate heavy vision transformer during inference.

🔮 Future ImplicationsAI analysis grounded in cited sources

Cloud-based LLM providers will face significant revenue erosion in the 'personal assistant' segment.
Local execution removes the per-token cost and latency barriers that currently make cloud-based agents impractical for high-frequency, privacy-sensitive mobile interactions.
Mobile hardware specifications will shift from 'RAM capacity' to 'NPU TOPS' as the primary differentiator for AI performance.
As models like Gemma 4 become standard, the bottleneck for user experience will move from memory availability to the raw throughput of the Neural Processing Unit.

Timeline

2024-02
Google releases the original Gemma 1 family of open models.
2024-05
Google introduces Gemma 2 with improved architecture and performance.
2025-03
Google releases Gemma 3, focusing on native multimodal capabilities.
2026-02
Google launches Gemma 4, emphasizing extreme efficiency and mobile-first deployment.
📰

Weekly AI Recap

Read this week's curated digest of top AI events →

👉Related Updates

AI-curated news aggregator. All content rights belong to original publishers.
Original source: 虎嗅