Qwen 3.5 Needs bf16 KV Cache

Post LinkedIn

🦙Read original on Reddit r/LocalLLaMA

#kv-cache #quantization #perplexity #llamacppqwen-3.5

💡Fix llama.cpp accuracy drop for Qwen 3.5: use bf16 KV cache, not f16

⚡ 30-Second TL;DR

What Changed

Default f16 KV cache in llama.cpp degrades Qwen 3.5 perplexity

Why It Matters

Prevents subtle accuracy loss in local runs, ensuring benchmarks align with official implementations.

What To Do Next

Add -ctk bf16 -ctv bf16 flags when running Qwen 3.5 in llama.cpp.

Who should care:Developers & AI Engineers

🧠 Deep Insight

Web-grounded analysis with 8 cited sources.

🔑 Enhanced Key Takeaways

•Qwen3.5 series includes massive MoE variants like 397B-A17B with 4.3% sparsity and native multimodality, achieving efficient KV cache usage at ~31KB per token in BF16 for 262K context[4][5].
•Qwen3.5-122B-A10B can be quantized to NVFP4 (75.6GB from 234GB BF16) using llm-compressor, enabling deployment on 128GB systems with 52GB KV cache headroom[2].
•Official Qwen repository confirms BF16 models as default for inference, with KV cache quantization explicitly supporting larger context lengths without OOM errors[8].

🛠️ Technical Deep Dive

•Qwen3.5-397B-A17B features Mixture-of-Experts (MoE) architecture with ~4.3% sparsity (17B active params), optimized for low KV cache overhead due to reduced KV heads (~31KB/token in BF16, ~4GB for 262K context)[4].
•NVFP4 quantization for Qwen3.5-122B-A10B uses 4-bit floating point weights with FP8 per-group scales (group size 16), achieving 3.1x compression via vllm-project/llm-compressor and compressed-tensors[2].
•KV cache in vLLM for Qwen3-VL employs PagedAttention for non-contiguous memory management, minimizing fragmentation during expansion for long contexts like 4096 tokens[6].

🔮 Future ImplicationsAI analysis grounded in cited sources

llama.cpp will add native bf16 KV cache support for Qwen3.5 by mid-2026

Community flags like -ctk bf16 -ctv bf16 highlight a gap versus vLLM's defaults, driving inevitable upstream integration as Qwen3.5 adoption grows[1][7].

KV cache quantization will become standard for Qwen3.5 MoE models on consumer GPUs

Quantized variants like NVFP4 free 50+GB for KV cache on limited hardware, matching official BF16 quality within 1-3% benchmark loss[2][8].

⏳ Timeline

2026-02

Qwen3.5 series released by Alibaba Cloud, introducing MoE models like 397B-A17B and native multimodality

2026-02-16

Official Qwen3.5 announcement with BF16 inference benchmarks and KV cache quantization support

📎 Sources (8)

Factual claims are grounded in the sources below. Forward-looking analysis is AI-generated interpretation.

🦙Read original article on Reddit r/LocalLLaMA

📰

Weekly AI Recap

Read this week's curated digest of top AI events →

👉Related Updates

Same topic

Explore #kv-cache

Same product