Qwen3.5-9B GGUF Quant Rankings by KLD

Post LinkedIn

🦙Read original on Reddit r/LocalLLaMA

#quantization #kld #ppl #benchmarkqwen3.5-9b

💡Data-driven GGUF quant guide: pick best Qwen3.5-9B file by KLD, not size alone.

⚡ 30-Second TL;DR

What Changed

Lowest KLD: Q8_0 (0.000814), unsloth UD-Q8_K_XL (0.000895)

Why It Matters

Guides quant selection for optimal fidelity vs size tradeoffs, helping deploy Qwen3.5-9B efficiently on consumer hardware. Exposes quantizer quality variances.

What To Do Next

Download bartowski Q4_K_S GGUF for Qwen3.5-9B to balance size and low KLD (0.0108).

Who should care:Researchers & Academics

🧠 Deep Insight

Web-grounded analysis with 8 cited sources.

🔑 Enhanced Key Takeaways

•Unsloth's March 5th 2026 update enhanced quantization for Qwen3.5 MoEs, reducing Maximum KLD significantly beyond 99.9% metrics by improving outlier handling[2].
•Qwen3.5-9B abliterated (uncensored) GGUF versions underperform even Q4_K_L quants at Q6_K levels, showing poor preservation of capabilities post-abliteration[1].
•Imatrix calibration substantially improves low-bit quantization performance across all Unsloth quants, particularly reducing KLD for sensitive tensors like ssm_out at 2 bits[2].
•Attn_* tensors and ssm_out are highly sensitive to heavy quantization in Qwen3.5's hybrid architecture, recommending higher precision to minimize KLD spikes[2].

📊 Competitor Analysis▸ Show

Quantizer	Key Features	Benchmark Strength (KLD/PPL)	VRAM Efficiency	Release/Update
bartowski	llama.cpp imatrix quants, IQ4_XS optimized	Lowest KLD in VRAM-limited (IQ4_XS: 0.0127), Q4_K_S standout	4.93-5.18 GiB for top	Ongoing[6]
unsloth	Dynamic UD quants, SOTA on 150+ KLD benchmarks, imatrix	Wins efficiency (UD-Q3_K_XL), post-Mar2026 max KLD reduction	Competitive low-bit	Mar 5 2026 update[2][4]
Standard llama.cpp	Baseline Q4_K_M etc	Beaten by bartowski on Q4_K_M (0.0087 vs unsloth 0.0222)	Standard	Used in evals[1]

🛠️ Technical Deep Dive

•Architecture: Dense Transformer Decoder with 32 layers, hidden dimension 4096, 32 attention heads (16 for QK), gated Delta Networks + sparse MoE for hybrid efficiency[3][4].
•Context: 128K tokens, vocabulary ~150K, supports FP16/INT8/INT4 precisions, consumer GPU compatible (RTX 3060/4060 quantized)[3].
•Quant sensitivity: ffn_up_exps/ffn_gate_exps tolerate 3-bit; attn_* and ssm_out/*beta/alpha highly sensitive—avoid heavy quant or MXFP4 (worse than Q4_K at 4.5 bits)[2].
•Embed/output: Some quants (Q3_K_XL, Q4_K_L) use Q8_0 for embeddings/outputs instead of defaults[6].

🔮 Future ImplicationsAI analysis grounded in cited sources

Unsloth UD quants will dominate low-VRAM deployments by Q2 2026

March 2026 updates achieve SOTA across 150+ benchmarks with imatrix and max KLD reductions, outperforming rivals in efficiency metrics[2].

Abliterated Qwen3.5-9B quants remain unsuitable for production below Q8_0

Evaluations show even Q6_K abliterated versions significantly underperform standard Q4_K_L baselines[1].

Hybrid tensor quantization recipes will standardize for Qwen3.5 MoEs

Sensitivity analysis identifies optimal per-tensor bits (e.g., higher for attn/ssm), enabling 20-30% better KLD at low bits via imatrix[2].

⏳ Timeline

2026-01

Qwen3.5-9B released by Alibaba Qwen team as open-source dense model with 128K context

2026-02

Unsloth releases initial Qwen3.5-9B GGUF benchmarks and dynamic UD quants

2026-03-05

Unsloth updates quantization for MoEs, reducing max KLD with imatrix enhancements

2026-03-12

Reddit r/LocalLLaMA publishes KLD/PPL rankings of 46 Qwen3.5-9B GGUF quants

📎 Sources (8)

Factual claims are grounded in the sources below. Forward-looking analysis is AI-generated interpretation.

🦙Read original article on Reddit r/LocalLLaMA

📰

Weekly AI Recap

Read this week's curated digest of top AI events →

👉Related Updates

Same topic

Explore #quantization

Same product