๐Ÿฆ™Stalecollected in 13h

Qwen3.5-9B GGUF Quant Rankings by KLD

Qwen3.5-9B GGUF Quant Rankings by KLD
PostLinkedIn
๐Ÿฆ™Read original on Reddit r/LocalLLaMA

๐Ÿ’กData-driven GGUF quant guide: pick best Qwen3.5-9B file by KLD, not size alone.

โšก 30-Second TL;DR

What Changed

Lowest KLD: Q8_0 (0.000814), unsloth UD-Q8_K_XL (0.000895)

Why It Matters

Guides quant selection for optimal fidelity vs size tradeoffs, helping deploy Qwen3.5-9B efficiently on consumer hardware. Exposes quantizer quality variances.

What To Do Next

Download bartowski Q4_K_S GGUF for Qwen3.5-9B to balance size and low KLD (0.0108).

Who should care:Researchers & Academics

๐Ÿง  Deep Insight

Web-grounded analysis with 8 cited sources.

๐Ÿ”‘ Enhanced Key Takeaways

  • โ€ขUnsloth's March 5th 2026 update enhanced quantization for Qwen3.5 MoEs, reducing Maximum KLD significantly beyond 99.9% metrics by improving outlier handling[2].
  • โ€ขQwen3.5-9B abliterated (uncensored) GGUF versions underperform even Q4_K_L quants at Q6_K levels, showing poor preservation of capabilities post-abliteration[1].
  • โ€ขImatrix calibration substantially improves low-bit quantization performance across all Unsloth quants, particularly reducing KLD for sensitive tensors like ssm_out at 2 bits[2].
  • โ€ขAttn_* tensors and ssm_out are highly sensitive to heavy quantization in Qwen3.5's hybrid architecture, recommending higher precision to minimize KLD spikes[2].
๐Ÿ“Š Competitor Analysisโ–ธ Show
QuantizerKey FeaturesBenchmark Strength (KLD/PPL)VRAM EfficiencyRelease/Update
bartowskillama.cpp imatrix quants, IQ4_XS optimizedLowest KLD in VRAM-limited (IQ4_XS: 0.0127), Q4_K_S standout4.93-5.18 GiB for topOngoing[6]
unslothDynamic UD quants, SOTA on 150+ KLD benchmarks, imatrixWins efficiency (UD-Q3_K_XL), post-Mar2026 max KLD reductionCompetitive low-bitMar 5 2026 update[2][4]
Standard llama.cppBaseline Q4_K_M etcBeaten by bartowski on Q4_K_M (0.0087 vs unsloth 0.0222)StandardUsed in evals[1]

๐Ÿ› ๏ธ Technical Deep Dive

  • โ€ขArchitecture: Dense Transformer Decoder with 32 layers, hidden dimension 4096, 32 attention heads (16 for QK), gated Delta Networks + sparse MoE for hybrid efficiency[3][4].
  • โ€ขContext: 128K tokens, vocabulary ~150K, supports FP16/INT8/INT4 precisions, consumer GPU compatible (RTX 3060/4060 quantized)[3].
  • โ€ขQuant sensitivity: ffn_up_exps/ffn_gate_exps tolerate 3-bit; attn_* and ssm_out/*beta/alpha highly sensitiveโ€”avoid heavy quant or MXFP4 (worse than Q4_K at 4.5 bits)[2].
  • โ€ขEmbed/output: Some quants (Q3_K_XL, Q4_K_L) use Q8_0 for embeddings/outputs instead of defaults[6].

๐Ÿ”ฎ Future ImplicationsAI analysis grounded in cited sources

Unsloth UD quants will dominate low-VRAM deployments by Q2 2026
March 2026 updates achieve SOTA across 150+ benchmarks with imatrix and max KLD reductions, outperforming rivals in efficiency metrics[2].
Abliterated Qwen3.5-9B quants remain unsuitable for production below Q8_0
Evaluations show even Q6_K abliterated versions significantly underperform standard Q4_K_L baselines[1].
Hybrid tensor quantization recipes will standardize for Qwen3.5 MoEs
Sensitivity analysis identifies optimal per-tensor bits (e.g., higher for attn/ssm), enabling 20-30% better KLD at low bits via imatrix[2].

โณ Timeline

2026-01
Qwen3.5-9B released by Alibaba Qwen team as open-source dense model with 128K context
2026-02
Unsloth releases initial Qwen3.5-9B GGUF benchmarks and dynamic UD quants
2026-03-05
Unsloth updates quantization for MoEs, reducing max KLD with imatrix enhancements
2026-03-12
Reddit r/LocalLLaMA publishes KLD/PPL rankings of 46 Qwen3.5-9B GGUF quants
๐Ÿ“ฐ

Weekly AI Recap

Read this week's curated digest of top AI events โ†’

๐Ÿ‘‰Related Updates

AI-curated news aggregator. All content rights belong to original publishers.
Original source: Reddit r/LocalLLaMA โ†—