2000 TPS Qwen 3.5 27B on RTX 5090
๐กUnlock 2000 TPS local inference on RTX 5090 for high-volume doc processing (game-changer for batch jobs)
โก 30-Second TL;DR
What Changed
Achieved ~2000 TPS classifying 320 docs with 1.2M input tokens
Why It Matters
Demonstrates high-throughput local inference for batch document tasks, enabling cost-effective processing on consumer GPUs without cloud dependency.
What To Do Next
Test unsloth/Qwen3.5-27B-UD-Q5_K_XL.gguf with llama.cpp server-cuda13 for batch classification jobs.
๐ง Deep Insight
Web-grounded analysis with 7 cited sources.
๐ Enhanced Key Takeaways
- โขQwen 3.5 27B achieves 20 tokens/second on RTX A6000 with standard settings, but optimized configurations on RTX 5090 with quantization and batch processing can exceed 1,000+ tokens/second, demonstrating the significant performance variance based on hardware and inference engine tuning[1][4]
- โขThe Qwen 3.5 series uses sparse Mixture-of-Experts (MoE) architecture with 397B total parameters but only 17B active per forward pass, enabling frontier-level performance on consumer hardware by reducing computational overhead compared to dense models[1]
- โขQuantization strategies (Q4, Q5_K_XL, NVFP4) are critical for consumer GPU deployment; the 27B model requires 20-24GB VRAM at Q4 quantization but can achieve 40+ tokens/second on RTX A6000 and 500+ tokens/second on RTX 5090 with optimized inference engines like vLLM[1][4][5]
๐ Competitor Analysisโธ Show
| Feature | Qwen 3.5 27B (RTX 5090) | Qwen 3 32B (RTX 5090) | Qwen 3.5 397B-A17B (Multi-GPU) |
|---|---|---|---|
| Throughput (tok/s) | 1,000-2,000 (optimized) | 500-600 | 25+ (with MoE offloading) |
| VRAM Required (Q4) | 20-24GB | 24GB+ | ~214GB (Q4) |
| Context Length | 128K (tested) | Standard | Up to 262K (tested) |
| Architecture | Sparse MoE | Dense | Sparse MoE (397B/17B active) |
| Inference Engine | llama.cpp, vLLM | vLLM | vLLM, SGLang |
| Release Date | Feb 2026 | Prior generation | Feb 2026 |
๐ ๏ธ Technical Deep Dive
- Quantization Impact: Q5_K_XL quantization on Qwen3.5-27B enables 2,000 TPS on RTX 5090; Q4 quantization reduces VRAM footprint to 20GB while maintaining 40+ tokens/second on A6000[1][4][5]
- Batch Processing: Batch size 8 with 128K context window on RTX 5090 sustains throughput across 1.2M+ input tokens without degradation; MCR (Max Concurrent Requests) tuning at 16 yields 1,157 tok/s with sub-second TTFT (956ms)[4]
- Context Window Scaling: RTX 5090 tested up to 122,880 tokens (~120K) with flat throughput (~555 tok/s at 8K vs ~553 tok/s at 65K); OOM occurs at 131,072+ tokens[4]
- Inference Engines: vLLM outperforms SGLang on RTX 5090 (555.82 tok/s vs 207.93 tok/s); PRO 6000 achieves 988 tok/s with 262K context using vLLM[4]
- Vision Module Overhead: Disabling vision loading and thinking modules reduces memory footprint and increases throughput; llama.cpp server-cuda13 supports these optimizations[1]
๐ฎ Future ImplicationsAI analysis grounded in cited sources
โณ Timeline
๐ Sources (7)
Factual claims are grounded in the sources below. Forward-looking analysis is AI-generated interpretation.
Weekly AI Recap
Read this week's curated digest of top AI events โ
๐Related Updates
AI-curated news aggregator. All content rights belong to original publishers.
Original source: Reddit r/LocalLLaMA โ