Qwen3.5-35B GGUF Quants Benchmarked
๐กDetailed KLD/speed benchmarks for Qwen3.5-35B quantsโpick best for your GPU (up to 143 t/s)
โก 30-Second TL;DR
What Changed
KLD measured on mixed FLORES 200 + calibration_data_v5_rc.txt datasets
Why It Matters
Helps GPU-limited users pick optimal Qwen3.5-35B quants for quality vs speed. Highlights quantization trade-offs in local inference.
What To Do Next
Download top KLD quant like unsloth_UD-Q4_K_XL from Hugging Face and benchmark on your RTX setup.
๐ง Deep Insight
Web-grounded analysis with 5 cited sources.
๐ Enhanced Key Takeaways
- โขUnsloth conducted over 150 KL Divergence benchmarks totaling 9TB of GGUF files, identifying optimal quantization for Qwen3.5 MoE tensors like ffn_up_exps and ffn_gate_exps at 3-bit while avoiding heavy quantization on attn_* layers[1].
- โขA March 5th, 2026 update to Unsloth's quantization method significantly reduced Maximum KLD for Qwen3.5-35B quants, with UD-Q4_K_XL dropping 51% from 5.894 to 2.877[1].
- โขQwen3.5-35B-A3B is a mixture-of-experts model with 35B total parameters, activating top-9 out of 256 experts per token via routing, enabling efficient compute and strong benchmarks like 84.2 on GPQA Diamond[2].
๐ ๏ธ Technical Deep Dive
- โขQwen3.5-35B-A3B uses a gated delta network for efficient long-context reasoning; GGUF Q8_1 quant (37GB) fits full GPU offload with 32K context in under 37GB VRAM, with ~0.1 perplexity loss vs BF16[2].
- โขUnsloth Dynamic quants avoid MXFP4 on sensitive tensors like attn_qkv, attn_gate, ssm_beta/alpha, preferring Q4_K (4.5 bits/weight) over MXFP4 (4.25 bits/weight) for better KLD despite similar bitwidths[1][3].
- โขKLD and perplexity are non-monotonic across bitwidths; e.g., Q3_K can outperform Q4_K in some cases due to tensor-specific sensitivities[1][3].
- โขOptimized inference achieves 125 t/s on 16GB NVIDIA GPUs with --parallel 1 flag (10x speedup) and supports up to 155904 token contexts; 200k context at 62.98 t/s on RTX 5080 with Q4 quant[3][4].
๐ฎ Future ImplicationsAI analysis grounded in cited sources
โณ Timeline
๐ Sources (5)
Factual claims are grounded in the sources below. Forward-looking analysis is AI-generated interpretation.
Weekly AI Recap
Read this week's curated digest of top AI events โ
๐Related Updates
AI-curated news aggregator. All content rights belong to original publishers.
Original source: Reddit r/LocalLLaMA โ
