๐Ÿฆ™Stalecollected in 3h

Qwen 3.5 Needs bf16 KV Cache

PostLinkedIn
๐Ÿฆ™Read original on Reddit r/LocalLLaMA

๐Ÿ’กFix llama.cpp accuracy drop for Qwen 3.5: use bf16 KV cache, not f16

โšก 30-Second TL;DR

What Changed

Default f16 KV cache in llama.cpp degrades Qwen 3.5 perplexity

Why It Matters

Prevents subtle accuracy loss in local runs, ensuring benchmarks align with official implementations.

What To Do Next

Add -ctk bf16 -ctv bf16 flags when running Qwen 3.5 in llama.cpp.

Who should care:Developers & AI Engineers

๐Ÿง  Deep Insight

Web-grounded analysis with 8 cited sources.

๐Ÿ”‘ Enhanced Key Takeaways

  • โ€ขQwen3.5 series includes massive MoE variants like 397B-A17B with 4.3% sparsity and native multimodality, achieving efficient KV cache usage at ~31KB per token in BF16 for 262K context[4][5].
  • โ€ขQwen3.5-122B-A10B can be quantized to NVFP4 (75.6GB from 234GB BF16) using llm-compressor, enabling deployment on 128GB systems with 52GB KV cache headroom[2].
  • โ€ขOfficial Qwen repository confirms BF16 models as default for inference, with KV cache quantization explicitly supporting larger context lengths without OOM errors[8].

๐Ÿ› ๏ธ Technical Deep Dive

  • โ€ขQwen3.5-397B-A17B features Mixture-of-Experts (MoE) architecture with ~4.3% sparsity (17B active params), optimized for low KV cache overhead due to reduced KV heads (~31KB/token in BF16, ~4GB for 262K context)[4].
  • โ€ขNVFP4 quantization for Qwen3.5-122B-A10B uses 4-bit floating point weights with FP8 per-group scales (group size 16), achieving 3.1x compression via vllm-project/llm-compressor and compressed-tensors[2].
  • โ€ขKV cache in vLLM for Qwen3-VL employs PagedAttention for non-contiguous memory management, minimizing fragmentation during expansion for long contexts like 4096 tokens[6].

๐Ÿ”ฎ Future ImplicationsAI analysis grounded in cited sources

llama.cpp will add native bf16 KV cache support for Qwen3.5 by mid-2026
Community flags like -ctk bf16 -ctv bf16 highlight a gap versus vLLM's defaults, driving inevitable upstream integration as Qwen3.5 adoption grows[1][7].
KV cache quantization will become standard for Qwen3.5 MoE models on consumer GPUs
Quantized variants like NVFP4 free 50+GB for KV cache on limited hardware, matching official BF16 quality within 1-3% benchmark loss[2][8].

โณ Timeline

2026-02
Qwen3.5 series released by Alibaba Cloud, introducing MoE models like 397B-A17B and native multimodality
2026-02-16
Official Qwen3.5 announcement with BF16 inference benchmarks and KV cache quantization support
๐Ÿ“ฐ

Weekly AI Recap

Read this week's curated digest of top AI events โ†’

๐Ÿ‘‰Related Updates

AI-curated news aggregator. All content rights belong to original publishers.
Original source: Reddit r/LocalLLaMA โ†—