๐Ÿฆ™Stalecollected in 11m

Qwen3.5-35B GGUF Quants Benchmarked

PostLinkedIn
๐Ÿฆ™Read original on Reddit r/LocalLLaMA

๐Ÿ’กDetailed KLD/speed benchmarks for Qwen3.5-35B quantsโ€”pick best for your GPU (up to 143 t/s)

โšก 30-Second TL;DR

What Changed

KLD measured on mixed FLORES 200 + calibration_data_v5_rc.txt datasets

Why It Matters

Helps GPU-limited users pick optimal Qwen3.5-35B quants for quality vs speed. Highlights quantization trade-offs in local inference.

What To Do Next

Download top KLD quant like unsloth_UD-Q4_K_XL from Hugging Face and benchmark on your RTX setup.

Who should care:Developers & AI Engineers

๐Ÿง  Deep Insight

Web-grounded analysis with 5 cited sources.

๐Ÿ”‘ Enhanced Key Takeaways

  • โ€ขUnsloth conducted over 150 KL Divergence benchmarks totaling 9TB of GGUF files, identifying optimal quantization for Qwen3.5 MoE tensors like ffn_up_exps and ffn_gate_exps at 3-bit while avoiding heavy quantization on attn_* layers[1].
  • โ€ขA March 5th, 2026 update to Unsloth's quantization method significantly reduced Maximum KLD for Qwen3.5-35B quants, with UD-Q4_K_XL dropping 51% from 5.894 to 2.877[1].
  • โ€ขQwen3.5-35B-A3B is a mixture-of-experts model with 35B total parameters, activating top-9 out of 256 experts per token via routing, enabling efficient compute and strong benchmarks like 84.2 on GPQA Diamond[2].

๐Ÿ› ๏ธ Technical Deep Dive

  • โ€ขQwen3.5-35B-A3B uses a gated delta network for efficient long-context reasoning; GGUF Q8_1 quant (37GB) fits full GPU offload with 32K context in under 37GB VRAM, with ~0.1 perplexity loss vs BF16[2].
  • โ€ขUnsloth Dynamic quants avoid MXFP4 on sensitive tensors like attn_qkv, attn_gate, ssm_beta/alpha, preferring Q4_K (4.5 bits/weight) over MXFP4 (4.25 bits/weight) for better KLD despite similar bitwidths[1][3].
  • โ€ขKLD and perplexity are non-monotonic across bitwidths; e.g., Q3_K can outperform Q4_K in some cases due to tensor-specific sensitivities[1][3].
  • โ€ขOptimized inference achieves 125 t/s on 16GB NVIDIA GPUs with --parallel 1 flag (10x speedup) and supports up to 155904 token contexts; 200k context at 62.98 t/s on RTX 5080 with Q4 quant[3][4].

๐Ÿ”ฎ Future ImplicationsAI analysis grounded in cited sources

Unsloth Dynamic quants will become the standard for Qwen3.5 MoE local deployment on consumer GPUs.
Their SOTA KLD scores across bits, 9TB of benchmarks, and speeds up to 125 t/s on 16GB VRAM demonstrate superior quality and accessibility over prior methods[1][3][4].
MoE quantization advances will enable 35B-scale models on 16GB hardware without quality loss.
Tensor-specific optimizations like 3-bit FFN and preserved high-precision attention, validated by reduced max KLD post-March 2026 update, minimize degradation while fitting tight VRAM[1][2].

โณ Timeline

2026-03
Qwen3.5-35B-A3B model release by Alibaba, introducing 256-expert MoE architecture
2026-03-05
Unsloth releases updated Dynamic GGUF quants for Qwen3.5-35B with reduced Maximum KLD
2026-03-16
Reddit r/LocalLLaMA benchmarks Unsloth UD-Q4_K_XL as top Qwen3.5-35B GGUF quant

๐Ÿ“Ž Sources (5)

Factual claims are grounded in the sources below. Forward-looking analysis is AI-generated interpretation.

  1. unsloth.ai โ€” Gguf Benchmarks
  2. sonusahani.com โ€” Qwen 35b
  3. news.ycombinator.com โ€” Item
  4. GitHub โ€” Qwen 3.5 16g Vram Local
  5. modelscope.cn โ€” Qwen3.5 35b A3b Gguf
๐Ÿ“ฐ

Weekly AI Recap

Read this week's curated digest of top AI events โ†’

๐Ÿ‘‰Related Updates

AI-curated news aggregator. All content rights belong to original publishers.
Original source: Reddit r/LocalLLaMA โ†—