Qwen3.5-35B GGUF Quants Benchmarked

Post LinkedIn

🦙Read original on Reddit r/LocalLLaMA

#quantization #benchmarks #kld #local-inferenceqwen3.5-35b

💡Detailed KLD/speed benchmarks for Qwen3.5-35B quants—pick best for your GPU (up to 143 t/s)

⚡ 30-Second TL;DR

What Changed

KLD measured on mixed FLORES 200 + calibration_data_v5_rc.txt datasets

Why It Matters

Helps GPU-limited users pick optimal Qwen3.5-35B quants for quality vs speed. Highlights quantization trade-offs in local inference.

What To Do Next

Download top KLD quant like unsloth_UD-Q4_K_XL from Hugging Face and benchmark on your RTX setup.

Who should care:Developers & AI Engineers

🧠 Deep Insight

Web-grounded analysis with 5 cited sources.

🔑 Enhanced Key Takeaways

•Unsloth conducted over 150 KL Divergence benchmarks totaling 9TB of GGUF files, identifying optimal quantization for Qwen3.5 MoE tensors like ffn_up_exps and ffn_gate_exps at 3-bit while avoiding heavy quantization on attn_* layers[1].
•A March 5th, 2026 update to Unsloth's quantization method significantly reduced Maximum KLD for Qwen3.5-35B quants, with UD-Q4_K_XL dropping 51% from 5.894 to 2.877[1].
•Qwen3.5-35B-A3B is a mixture-of-experts model with 35B total parameters, activating top-9 out of 256 experts per token via routing, enabling efficient compute and strong benchmarks like 84.2 on GPQA Diamond[2].

🛠️ Technical Deep Dive

•Qwen3.5-35B-A3B uses a gated delta network for efficient long-context reasoning; GGUF Q8_1 quant (37GB) fits full GPU offload with 32K context in under 37GB VRAM, with ~0.1 perplexity loss vs BF16[2].
•Unsloth Dynamic quants avoid MXFP4 on sensitive tensors like attn_qkv, attn_gate, ssm_beta/alpha, preferring Q4_K (4.5 bits/weight) over MXFP4 (4.25 bits/weight) for better KLD despite similar bitwidths[1][3].
•KLD and perplexity are non-monotonic across bitwidths; e.g., Q3_K can outperform Q4_K in some cases due to tensor-specific sensitivities[1][3].
•Optimized inference achieves 125 t/s on 16GB NVIDIA GPUs with --parallel 1 flag (10x speedup) and supports up to 155904 token contexts; 200k context at 62.98 t/s on RTX 5080 with Q4 quant[3][4].

🔮 Future ImplicationsAI analysis grounded in cited sources

Unsloth Dynamic quants will become the standard for Qwen3.5 MoE local deployment on consumer GPUs.

Their SOTA KLD scores across bits, 9TB of benchmarks, and speeds up to 125 t/s on 16GB VRAM demonstrate superior quality and accessibility over prior methods[1][3][4].

MoE quantization advances will enable 35B-scale models on 16GB hardware without quality loss.

Tensor-specific optimizations like 3-bit FFN and preserved high-precision attention, validated by reduced max KLD post-March 2026 update, minimize degradation while fitting tight VRAM[1][2].

⏳ Timeline

2026-03

Qwen3.5-35B-A3B model release by Alibaba, introducing 256-expert MoE architecture

2026-03-05

Unsloth releases updated Dynamic GGUF quants for Qwen3.5-35B with reduced Maximum KLD

2026-03-16

Reddit r/LocalLLaMA benchmarks Unsloth UD-Q4_K_XL as top Qwen3.5-35B GGUF quant

📎 Sources (5)

Factual claims are grounded in the sources below. Forward-looking analysis is AI-generated interpretation.

🦙Read original article on Reddit r/LocalLLaMA

📰

Weekly AI Recap

Read this week's curated digest of top AI events →

👉Related Updates

Same topic

Explore #quantization

Same product

DeepSeek V4 Pro Benchmarks Impressive Yet Lags Rivals

SCMP Technology•Apr 25

AI-curated news aggregator. All content rights belong to original publishers.
Original source: Reddit r/LocalLLaMA ↗