Qwen3.5-35B-A3B RTX 5080 Benchmarks Update

💡Proves KV q8_0 free speed boost for Qwen MoE on RTX 5080—test now for 74 tok/s.

⚡ 30-Second TL;DR

What Changed

KV q8_0 confirmed 'free lunch' with <0.4% PPL change

Why It Matters

Optimizes local inference for MoE models on consumer GPUs, enabling faster speeds without quality trade-offs for AI builders running large models.

What To Do Next

Run llama.cpp with -ctk q8_0 -ctv q8_0 on your Qwen3.5-35B-A3B to boost throughput 12-38%.

Who should care:Developers & AI Engineers

Web-grounded analysis with 7 cited sources.

•Qwen3.5-35B-A3B released on February 24, 2026, as part of Alibaba's Qwen3.5 series emphasizing 'more intelligence, less compute' with MoE architecture outperforming larger predecessors[1][2][6].
•Model supports 262,144 token context length and native multimodal inputs (text, image, video) with benchmarks like GPQA 84.5% and Tau-Bench 89.2%[1][2][3].
•Features Gated Delta Networks and sparse MoE (256 experts, 8 routed + 1 shared active) for efficient inference, comparable to Qwen3.5-27B dense model[1][2].
•Available via APIs with pricing at $0.25/1M input tokens and $2.00/1M output tokens; supports reasoning mode with step-by-step thinking[2][5].

•35B total parameters, 3B activated; hybrid architecture with linear attention, sparse Mixture-of-Experts (256 total experts, 8 routed + 1 shared active), RoPE positional embeddings, SwiGLU activations, RMSNorm[1][2][5].
•Unified vision-language foundation via early fusion training on multimodal tokens for reasoning, coding, agents, and visual understanding[1][7].
•Scalable RL trained across million-agent environments; supports 201 languages/dialects and tool use[1].
•64 layers in advanced transformer architecture[5].

MoE efficiency enables consumer GPU local inference at frontier performance

RTX 5080 benchmarks show 74.7 tok/s with quantization, aligning with community push for INT4 variants and 'local frontier' trends[4].

Qwen3.5 series accelerates shift from parameter scaling to architectural optimization

35B-A3B outperforms models over 6x larger via hybrid MoE and RL, as claimed by Alibaba[1][4].

2026-02-24

Qwen3.5 series release including Qwen3.5-35B-A3B, Qwen3.5-122B-A10B, and Qwen3.5-27B by Alibaba Qwen team[6]

2026-02-25

Model added to platforms like Writingmate and early practitioner tests highlight strong performance[3][4]

2026-02-28

Community benchmarks on RTX 5080 with llama.cpp confirm quantization throughput gains for Qwen3.5-35B-A3B[article]

Factual claims are grounded in the sources below. Forward-looking analysis is AI-generated interpretation.

Weekly AI Recap

Read this week's curated digest of top AI events →

Same topic

Explore #quantization

Same product