๐Ÿฆ™Stalecollected in 53m

Qwen3 9B runs 6+ t/s on Android phones

Qwen3 9B runs 6+ t/s on Android phones
PostLinkedIn
๐Ÿฆ™Read original on Reddit r/LocalLLaMA

๐Ÿ’ก9B LLM hits 6t/s on phonesโ€”unlock mobile AI now

โšก 30-Second TL;DR

What Changed

Runs at q4_0 on S25 Ultra with 12GB RAM

Why It Matters

Shows large LLMs like 9B models are feasible on high-end Android phones, enabling edge AI apps without cloud dependency.

What To Do Next

Quantize Qwen3 9B to q4_0 and benchmark on your Android device with llama.cpp.

Who should care:Developers & AI Engineers

๐Ÿง  Deep Insight

Web-grounded analysis with 7 cited sources.

๐Ÿ”‘ Enhanced Key Takeaways

  • โ€ขQwen3.5-9B employs a hybrid architecture combining Gated DeltaNet with Sparse MoE, using a 3:1 ratio of linear to softmax attention for reduced memory and compute costs[1][3][7].
  • โ€ขThe model supports a 262K token context window and native multimodality, processing text and visual data in a unified latent space[2][6][7].
  • โ€ขQwen3.5-9B outperforms prior Qwen3-30B on MMLU and math benchmarks like GSM8K/MATH due to Scaled RL training[1][3].
  • โ€ขAt Q4 quantization, it runs on ~5GB RAM with CPU-only inference at 20-30 t/s, and higher speeds like 115-167 t/s in optimized desktop tests[3][5].

๐Ÿ› ๏ธ Technical Deep Dive

  • โ€ขHybrid architecture: Gated DeltaNet + Sparse Mixture-of-Experts (MoE), with 3:1 linear attention to softmax attention ratio, reducing computational cost for long contexts[3][7].
  • โ€ขParameter count: 9.65B, natively multimodal vision-language model[6][7].
  • โ€ขTraining: Scaled Reinforcement Learning (RL) optimizes logical reasoning, closing gap with 30B+ models[1][2][3].
  • โ€ขContext window: 262K tokens[6].
  • โ€ขQuantized (Q4 GGUF): ~5GB RAM footprint, supports CUDA/NVIDIA GPU, Metal/Apple Silicon, or CPU inference[3].

๐Ÿ”ฎ Future ImplicationsAI analysis grounded in cited sources

On-device AI apps will deploy 9B-scale models on smartphones by mid-2026
Qwen3.5-9B's 6+ t/s on S25 Ultra and 5GB RAM compatibility enable real-time inference without cloud dependency[1][3].
Hybrid attention architectures will become standard in sub-10B models
Gated DeltaNet + MoE in Qwen3.5-9B outperforms larger dense models on benchmarks while cutting compute costs[3][7].
Native multimodality will dominate edge AI by 2027
Qwen3.5 series from 4B integrates vision-text in shared latent space, boosting agent tasks like UI navigation over adapter systems[1][2].

โณ Timeline

2025-04
Qwen3 released with dense and MoE models up to 235B parameters
2026-03
Qwen3.5 Small series launched including 0.8B-9B models optimized for edge devices
๐Ÿ“ฐ

Weekly AI Recap

Read this week's curated digest of top AI events โ†’

๐Ÿ‘‰Related Updates

AI-curated news aggregator. All content rights belong to original publishers.
Original source: Reddit r/LocalLLaMA โ†—