AI Updates Aggregator

🐯虎嗅•Apr 6, 2026Stalecollected in 49m

Gemma 4 Runs 40+ t/s on iPhone Locally

Post LinkedIn

🐯Read original on 虎嗅

#mobile-inference #on-device-ai #llm-deploymentgemma-4gemma-4 google-ai-edge-gallery mlx

💡40 t/s local Gemma 4 on phones kills cloud dependency for simple tasks

⚡ 30-Second TL;DR

What Changed

E2B (2.3B) and E4B (4.5B) models fit phones with 128K context

Why It Matters

Accelerates on-device AI adoption, pressuring cloud providers to specialize in complex tasks. End-side models handle daily queries, reshaping AI economics.

What To Do Next

Download Google AI Edge Gallery and benchmark Gemma 4 E4B on your smartphone.

Who should care:Developers & AI Engineers

Key Points

•E2B (2.3B) and E4B (4.5B) models fit phones with 128K context
•40+ tokens/sec on iPhone with Apple MLX optimization
•Google AI Edge Gallery app for one-tap model download/run
•26B MoE version struggles with agent tasks like tool calling
•Multimodal support for image/audio on mobile

🧠 Deep Insight

AI-generated analysis for this event.

🔑 Enhanced Key Takeaways

•Google's implementation leverages the Apple Neural Engine (ANE) via the MLX framework, specifically utilizing 4-bit quantization to maintain high token throughput while minimizing memory footprint on mobile devices.
•The 128K context window is achieved through a combination of Grouped Query Attention (GQA) and a novel sliding-window attention mechanism that reduces KV cache memory overhead during inference.
•The AI Edge Gallery app utilizes a unified model format (GGUF-compatible) that allows for cross-platform portability, enabling the same model weights to run on both iOS and Android devices with minimal hardware-specific tuning.

📊 Competitor Analysis▸ Show

Feature	Google Gemma 4 (E4B)	Apple OpenELM (3B)	Meta Llama 3.2 (3B)
Architecture	Dense/MoE Hybrid	Sparse/Dense	Dense
Context Window	128K	4K	128K
Mobile Optimization	Native MLX/ANE	Native MLX	Via CoreML/ExecuTorch
License	Gemma Terms	Sample Code License	Llama 3.2 Community License

🛠️ Technical Deep Dive

Quantization Strategy: Employs 4-bit weight-only quantization (NF4) to fit the E4B model within the restricted VRAM/RAM shared environment of the iPhone 16 Pro/Max series.
Memory Management: Uses a custom KV cache eviction policy that dynamically compresses historical tokens when the 128K limit is approached, prioritizing recent context for agentic tasks.
Inference Engine: Built on MLX, which performs graph fusion to minimize kernel launches, significantly reducing latency for small-batch, high-frequency token generation.
Multimodal Pipeline: The vision encoder uses a lightweight projection layer that maps image embeddings directly into the LLM's latent space, bypassing the need for a separate heavy vision transformer during inference.

🔮 Future ImplicationsAI analysis grounded in cited sources

Cloud-based LLM providers will face significant revenue erosion in the 'personal assistant' segment.

Local execution removes the per-token cost and latency barriers that currently make cloud-based agents impractical for high-frequency, privacy-sensitive mobile interactions.

Mobile hardware specifications will shift from 'RAM capacity' to 'NPU TOPS' as the primary differentiator for AI performance.

As models like Gemma 4 become standard, the bottleneck for user experience will move from memory availability to the raw throughput of the Neural Processing Unit.

⏳ Timeline

2024-02

Google releases the original Gemma 1 family of open models.

2024-05

Google introduces Gemma 2 with improved architecture and performance.

2025-03

Google releases Gemma 3, focusing on native multimodal capabilities.

2026-02

Google launches Gemma 4, emphasizing extreme efficiency and mobile-first deployment.

🐯Read original article on 虎嗅

📰

Weekly AI Recap

Read this week's curated digest of top AI events →

👉Related Updates

Same topic

Explore #mobile-inference

Same product

AI-curated news aggregator. All content rights belong to original publishers.
Original source: 虎嗅 ↗

⚡ 30-Second TL;DR

Key Points

🧠 Deep Insight

🔑 Enhanced Key Takeaways

🛠️ Technical Deep Dive

🔮 Future ImplicationsAI analysis grounded in cited sources

⏳ Timeline

👉Related Updates

Deconstructing the Elon Musk Principles for neurodivergent growth

AI and IoT Tech in World Cup

Tacit Knowledge in Strategic Planning and Management

A Quantitative Method for Strategic Consensus Analysis