๐Ÿฆ™Stalecollected in 21h

Qwen3.5-35B-A3B RTX 5080 Benchmarks Update

PostLinkedIn
๐Ÿฆ™Read original on Reddit r/LocalLLaMA

๐Ÿ’กProves KV q8_0 free speed boost for Qwen MoE on RTX 5080โ€”test now for 74 tok/s.

โšก 30-Second TL;DR

What Changed

KV q8_0 confirmed 'free lunch' with <0.4% PPL change

Why It Matters

Optimizes local inference for MoE models on consumer GPUs, enabling faster speeds without quality trade-offs for AI builders running large models.

What To Do Next

Run llama.cpp with -ctk q8_0 -ctv q8_0 on your Qwen3.5-35B-A3B to boost throughput 12-38%.

Who should care:Developers & AI Engineers

๐Ÿง  Deep Insight

Web-grounded analysis with 7 cited sources.

๐Ÿ”‘ Enhanced Key Takeaways

  • โ€ขQwen3.5-35B-A3B released on February 24, 2026, as part of Alibaba's Qwen3.5 series emphasizing 'more intelligence, less compute' with MoE architecture outperforming larger predecessors[1][2][6].
  • โ€ขModel supports 262,144 token context length and native multimodal inputs (text, image, video) with benchmarks like GPQA 84.5% and Tau-Bench 89.2%[1][2][3].
  • โ€ขFeatures Gated Delta Networks and sparse MoE (256 experts, 8 routed + 1 shared active) for efficient inference, comparable to Qwen3.5-27B dense model[1][2].
  • โ€ขAvailable via APIs with pricing at $0.25/1M input tokens and $2.00/1M output tokens; supports reasoning mode with step-by-step thinking[2][5].

๐Ÿ› ๏ธ Technical Deep Dive

  • โ€ข35B total parameters, 3B activated; hybrid architecture with linear attention, sparse Mixture-of-Experts (256 total experts, 8 routed + 1 shared active), RoPE positional embeddings, SwiGLU activations, RMSNorm[1][2][5].
  • โ€ขUnified vision-language foundation via early fusion training on multimodal tokens for reasoning, coding, agents, and visual understanding[1][7].
  • โ€ขScalable RL trained across million-agent environments; supports 201 languages/dialects and tool use[1].
  • โ€ข64 layers in advanced transformer architecture[5].

๐Ÿ”ฎ Future ImplicationsAI analysis grounded in cited sources

MoE efficiency enables consumer GPU local inference at frontier performance
RTX 5080 benchmarks show 74.7 tok/s with quantization, aligning with community push for INT4 variants and 'local frontier' trends[4].
Qwen3.5 series accelerates shift from parameter scaling to architectural optimization
35B-A3B outperforms models over 6x larger via hybrid MoE and RL, as claimed by Alibaba[1][4].

โณ Timeline

2026-02-24
Qwen3.5 series release including Qwen3.5-35B-A3B, Qwen3.5-122B-A10B, and Qwen3.5-27B by Alibaba Qwen team[6]
2026-02-25
Model added to platforms like Writingmate and early practitioner tests highlight strong performance[3][4]
2026-02-28
Community benchmarks on RTX 5080 with llama.cpp confirm quantization throughput gains for Qwen3.5-35B-A3B[article]
๐Ÿ“ฐ

Weekly AI Recap

Read this week's curated digest of top AI events โ†’

๐Ÿ‘‰Related Updates

AI-curated news aggregator. All content rights belong to original publishers.
Original source: Reddit r/LocalLLaMA โ†—