Qwen3 9B runs 6+ t/s on Android phones

💡9B LLM hits 6t/s on phones—unlock mobile AI now

⚡ 30-Second TL;DR

What Changed

Runs at q4_0 on S25 Ultra with 12GB RAM

Why It Matters

Shows large LLMs like 9B models are feasible on high-end Android phones, enabling edge AI apps without cloud dependency.

What To Do Next

Quantize Qwen3 9B to q4_0 and benchmark on your Android device with llama.cpp.

Who should care:Developers & AI Engineers

Web-grounded analysis with 7 cited sources.

•Qwen3.5-9B employs a hybrid architecture combining Gated DeltaNet with Sparse MoE, using a 3:1 ratio of linear to softmax attention for reduced memory and compute costs[1][3][7].
•The model supports a 262K token context window and native multimodality, processing text and visual data in a unified latent space[2][6][7].
•Qwen3.5-9B outperforms prior Qwen3-30B on MMLU and math benchmarks like GSM8K/MATH due to Scaled RL training[1][3].
•At Q4 quantization, it runs on ~5GB RAM with CPU-only inference at 20-30 t/s, and higher speeds like 115-167 t/s in optimized desktop tests[3][5].

•Hybrid architecture: Gated DeltaNet + Sparse Mixture-of-Experts (MoE), with 3:1 linear attention to softmax attention ratio, reducing computational cost for long contexts[3][7].
•Parameter count: 9.65B, natively multimodal vision-language model[6][7].
•Training: Scaled Reinforcement Learning (RL) optimizes logical reasoning, closing gap with 30B+ models[1][2][3].
•Context window: 262K tokens[6].
•Quantized (Q4 GGUF): ~5GB RAM footprint, supports CUDA/NVIDIA GPU, Metal/Apple Silicon, or CPU inference[3].

On-device AI apps will deploy 9B-scale models on smartphones by mid-2026

Qwen3.5-9B's 6+ t/s on S25 Ultra and 5GB RAM compatibility enable real-time inference without cloud dependency[1][3].

Hybrid attention architectures will become standard in sub-10B models

Gated DeltaNet + MoE in Qwen3.5-9B outperforms larger dense models on benchmarks while cutting compute costs[3][7].

Native multimodality will dominate edge AI by 2027

Qwen3.5 series from 4B integrates vision-text in shared latent space, boosting agent tasks like UI navigation over adapter systems[1][2].

2025-04

Qwen3 released with dense and MoE models up to 235B parameters

2026-03

Qwen3.5 Small series launched including 0.8B-9B models optimized for edge devices

Factual claims are grounded in the sources below. Forward-looking analysis is AI-generated interpretation.

Weekly AI Recap

Read this week's curated digest of top AI events →

Same topic

Explore #mobile-inference

Same product

AI-curated news aggregator. All content rights belong to original publishers.
Original source: Reddit r/LocalLLaMA ↗