Qwen3.5-27B FP8 vs BF16 Benchmarks

💡FP8 quantization equals BF16 on coding benchmarks—massive VRAM savings for 27B models!
⚡ 30-Second TL;DR
What Changed
10 runs of Aider benchmark with BF16/FP8 weight and KV cache combos
Why It Matters
Enables VRAM savings via FP8 quantization for local coding agents without quality drop. Useful for resource-constrained setups running full 27B models.
What To Do Next
Benchmark Qwen3.5-27B FP8 on vLLM with Aider for your coding agent setup.
🧠 Deep Insight
Web-grounded analysis with 8 cited sources.
🔑 Enhanced Key Takeaways
- •Qwen3.5-27B-FP8 features a Gated Delta Network + Gated Attention hybrid architecture with 64 layers and 5120 hidden dimension, supporting 201 languages, tool calling, and agentic workflows.[2]
- •On H100 SXM GPU, Qwen3.5-27B-FP8 achieves 312 tokens/s peak throughput and supports 6 concurrent chatbot users at 32K context.[2]
- •Official benchmarks show FP8 quantization reduces GPU memory usage compared to BF16 (e.g., Qwen3-1.7B: 2726MB vs 3412MB at short input) while maintaining competitive speed.[1]
- •Qwen3.5-27B scores 86.1 on MMLU-Pro, 72.4 on SWE-bench Verified, 82.3 on MMMU, competitive with larger models like 235B Qwen3.[2]
🛠️ Technical Deep Dive
- •FP8-quantized Qwen3.5-27B is multimodal (image-text-to-text) with shared expert and attention layers kept in 16-bit precision.[2][4]
- •On RTX Pro 6000 Blackwell (96GB VRAM), supports 1-4 concurrency, 1K-256K context, 3 users at 32K context with TTFT <12s and >15 tok/s generation.[2]
- •Qwen’s official GPTQ INT4 keeps attention layers unquantized, yielding memory footprint nearly identical to FP8 (30.9GB vs 30.3GB for 27B).[4]
- •Unsloth Dynamic quants (e.g., UD-Q4_K_XL) show perplexity close to BF16 with speed gains on GGUF benchmarks.[6]
🔮 Future ImplicationsAI analysis grounded in cited sources
⏳ Timeline
📎 Sources (8)
Factual claims are grounded in the sources below. Forward-looking analysis is AI-generated interpretation.
Weekly AI Recap
Read this week's curated digest of top AI events →
👉Related Updates
AI-curated news aggregator. All content rights belong to original publishers.
Original source: Reddit r/LocalLLaMA ↗