๐ฆReddit r/LocalLLaMAโขStalecollected in 3h
Qwen 3.5 Needs bf16 KV Cache
๐กFix llama.cpp accuracy drop for Qwen 3.5: use bf16 KV cache, not f16
โก 30-Second TL;DR
What Changed
Default f16 KV cache in llama.cpp degrades Qwen 3.5 perplexity
Why It Matters
Prevents subtle accuracy loss in local runs, ensuring benchmarks align with official implementations.
What To Do Next
Add -ctk bf16 -ctv bf16 flags when running Qwen 3.5 in llama.cpp.
Who should care:Developers & AI Engineers
๐ง Deep Insight
Web-grounded analysis with 8 cited sources.
๐ Enhanced Key Takeaways
- โขQwen3.5 series includes massive MoE variants like 397B-A17B with 4.3% sparsity and native multimodality, achieving efficient KV cache usage at ~31KB per token in BF16 for 262K context[4][5].
- โขQwen3.5-122B-A10B can be quantized to NVFP4 (75.6GB from 234GB BF16) using llm-compressor, enabling deployment on 128GB systems with 52GB KV cache headroom[2].
- โขOfficial Qwen repository confirms BF16 models as default for inference, with KV cache quantization explicitly supporting larger context lengths without OOM errors[8].
๐ ๏ธ Technical Deep Dive
- โขQwen3.5-397B-A17B features Mixture-of-Experts (MoE) architecture with ~4.3% sparsity (17B active params), optimized for low KV cache overhead due to reduced KV heads (~31KB/token in BF16, ~4GB for 262K context)[4].
- โขNVFP4 quantization for Qwen3.5-122B-A10B uses 4-bit floating point weights with FP8 per-group scales (group size 16), achieving 3.1x compression via vllm-project/llm-compressor and compressed-tensors[2].
- โขKV cache in vLLM for Qwen3-VL employs PagedAttention for non-contiguous memory management, minimizing fragmentation during expansion for long contexts like 4096 tokens[6].
๐ฎ Future ImplicationsAI analysis grounded in cited sources
llama.cpp will add native bf16 KV cache support for Qwen3.5 by mid-2026
โณ Timeline
2026-02
Qwen3.5 series released by Alibaba Cloud, introducing MoE models like 397B-A17B and native multimodality
2026-02-16
Official Qwen3.5 announcement with BF16 inference benchmarks and KV cache quantization support
๐ Sources (8)
Factual claims are grounded in the sources below. Forward-looking analysis is AI-generated interpretation.
๐ฐ
Weekly AI Recap
Read this week's curated digest of top AI events โ
๐Related Updates
AI-curated news aggregator. All content rights belong to original publishers.
Original source: Reddit r/LocalLLaMA โ