Qwen3.5-9B GGUF Quant Rankings by KLD

๐กData-driven GGUF quant guide: pick best Qwen3.5-9B file by KLD, not size alone.
โก 30-Second TL;DR
What Changed
Lowest KLD: Q8_0 (0.000814), unsloth UD-Q8_K_XL (0.000895)
Why It Matters
Guides quant selection for optimal fidelity vs size tradeoffs, helping deploy Qwen3.5-9B efficiently on consumer hardware. Exposes quantizer quality variances.
What To Do Next
Download bartowski Q4_K_S GGUF for Qwen3.5-9B to balance size and low KLD (0.0108).
๐ง Deep Insight
Web-grounded analysis with 8 cited sources.
๐ Enhanced Key Takeaways
- โขUnsloth's March 5th 2026 update enhanced quantization for Qwen3.5 MoEs, reducing Maximum KLD significantly beyond 99.9% metrics by improving outlier handling[2].
- โขQwen3.5-9B abliterated (uncensored) GGUF versions underperform even Q4_K_L quants at Q6_K levels, showing poor preservation of capabilities post-abliteration[1].
- โขImatrix calibration substantially improves low-bit quantization performance across all Unsloth quants, particularly reducing KLD for sensitive tensors like ssm_out at 2 bits[2].
- โขAttn_* tensors and ssm_out are highly sensitive to heavy quantization in Qwen3.5's hybrid architecture, recommending higher precision to minimize KLD spikes[2].
๐ Competitor Analysisโธ Show
| Quantizer | Key Features | Benchmark Strength (KLD/PPL) | VRAM Efficiency | Release/Update |
|---|---|---|---|---|
| bartowski | llama.cpp imatrix quants, IQ4_XS optimized | Lowest KLD in VRAM-limited (IQ4_XS: 0.0127), Q4_K_S standout | 4.93-5.18 GiB for top | Ongoing[6] |
| unsloth | Dynamic UD quants, SOTA on 150+ KLD benchmarks, imatrix | Wins efficiency (UD-Q3_K_XL), post-Mar2026 max KLD reduction | Competitive low-bit | Mar 5 2026 update[2][4] |
| Standard llama.cpp | Baseline Q4_K_M etc | Beaten by bartowski on Q4_K_M (0.0087 vs unsloth 0.0222) | Standard | Used in evals[1] |
๐ ๏ธ Technical Deep Dive
- โขArchitecture: Dense Transformer Decoder with 32 layers, hidden dimension 4096, 32 attention heads (16 for QK), gated Delta Networks + sparse MoE for hybrid efficiency[3][4].
- โขContext: 128K tokens, vocabulary ~150K, supports FP16/INT8/INT4 precisions, consumer GPU compatible (RTX 3060/4060 quantized)[3].
- โขQuant sensitivity: ffn_up_exps/ffn_gate_exps tolerate 3-bit; attn_* and ssm_out/*beta/alpha highly sensitiveโavoid heavy quant or MXFP4 (worse than Q4_K at 4.5 bits)[2].
- โขEmbed/output: Some quants (Q3_K_XL, Q4_K_L) use Q8_0 for embeddings/outputs instead of defaults[6].
๐ฎ Future ImplicationsAI analysis grounded in cited sources
โณ Timeline
๐ Sources (8)
Factual claims are grounded in the sources below. Forward-looking analysis is AI-generated interpretation.
- kaitchup.substack.com โ Summary of Qwen35 Gguf Evaluations
- unsloth.ai โ Gguf Benchmarks
- zimage.run โ Qwen3.5 9b Complete Guide
- Hugging Face โ Qwen3.5 9b Gguf
- Hugging Face โ Qwen3.5 2b Gguf
- Hugging Face โ Qwen Qwen3.5 9b Gguf
- kaitchup.substack.com โ More Qwen35 Gguf Evals and Speculative
- qwenlm.github.io โ Qwen3
Weekly AI Recap
Read this week's curated digest of top AI events โ
๐Related Updates
AI-curated news aggregator. All content rights belong to original publishers.
Original source: Reddit r/LocalLLaMA โ