🦙Stalecollected in 3h

Qwen3.5-27B FP8 vs BF16 Benchmarks

Qwen3.5-27B FP8 vs BF16 Benchmarks
PostLinkedIn
🦙Read original on Reddit r/LocalLLaMA

💡FP8 quantization equals BF16 on coding benchmarks—massive VRAM savings for 27B models!

⚡ 30-Second TL;DR

What Changed

10 runs of Aider benchmark with BF16/FP8 weight and KV cache combos

Why It Matters

Enables VRAM savings via FP8 quantization for local coding agents without quality drop. Useful for resource-constrained setups running full 27B models.

What To Do Next

Benchmark Qwen3.5-27B FP8 on vLLM with Aider for your coding agent setup.

Who should care:Researchers & Academics

🧠 Deep Insight

Web-grounded analysis with 8 cited sources.

🔑 Enhanced Key Takeaways

  • Qwen3.5-27B-FP8 features a Gated Delta Network + Gated Attention hybrid architecture with 64 layers and 5120 hidden dimension, supporting 201 languages, tool calling, and agentic workflows.[2]
  • On H100 SXM GPU, Qwen3.5-27B-FP8 achieves 312 tokens/s peak throughput and supports 6 concurrent chatbot users at 32K context.[2]
  • Official benchmarks show FP8 quantization reduces GPU memory usage compared to BF16 (e.g., Qwen3-1.7B: 2726MB vs 3412MB at short input) while maintaining competitive speed.[1]
  • Qwen3.5-27B scores 86.1 on MMLU-Pro, 72.4 on SWE-bench Verified, 82.3 on MMMU, competitive with larger models like 235B Qwen3.[2]

🛠️ Technical Deep Dive

  • FP8-quantized Qwen3.5-27B is multimodal (image-text-to-text) with shared expert and attention layers kept in 16-bit precision.[2][4]
  • On RTX Pro 6000 Blackwell (96GB VRAM), supports 1-4 concurrency, 1K-256K context, 3 users at 32K context with TTFT <12s and >15 tok/s generation.[2]
  • Qwen’s official GPTQ INT4 keeps attention layers unquantized, yielding memory footprint nearly identical to FP8 (30.9GB vs 30.3GB for 27B).[4]
  • Unsloth Dynamic quants (e.g., UD-Q4_K_XL) show perplexity close to BF16 with speed gains on GGUF benchmarks.[6]

🔮 Future ImplicationsAI analysis grounded in cited sources

FP8 will become standard for 27B-scale agentic models on consumer GPUs
Benchmarks confirm negligible accuracy loss versus BF16 with reduced memory, enabling broader local deployment as shown on RTX 6000.[1][2]
INT4 quantization outperforms FP8 in memory efficiency for Qwen3.5-27B
Official INT4 matches FP8 performance but uses less memory due to unquantized sensitive layers, per quantization studies.[4]
Qwen3.5-27B-FP8 supports up to 256K context on single H100
Millstone benchmarks validate high concurrency and throughput at extended contexts, ideal for agentic coding.[2]

Timeline

2025-09
Qwen3.5 series release with 27B dense model supporting multimodal and 201 languages.
2025-10
Official FP8 and GPTQ INT4 quantized models released by Qwen team.
2025-11
Speed benchmarks published for Qwen3 series BF16/FP8 on Transformers.
2026-01
Millstone AI publishes FP8 inference benchmarks on H100 and RTX Pro 6000.
2026-02
Unsloth releases GGUF benchmarks for Qwen3.5 dynamic quants.
2026-03
Reddit Aider benchmark shows FP8 matches BF16 for Qwen3.5-27B agentic coding.
📰

Weekly AI Recap

Read this week's curated digest of top AI events →

👉Related Updates

AI-curated news aggregator. All content rights belong to original publishers.
Original source: Reddit r/LocalLLaMA