Qwen3.5 27B Costs 0.83€/1M Output Tokens

💡Exact € costs for Qwen3.5 27B local runs: cheap input, pricey output on 3090—plan your infra.

⚡ 30-Second TL;DR

What Changed

Input uncached: 0.026€ per 1M tokens

Why It Matters

Provides concrete cost benchmarks for local LLM inference on consumer hardware, aiding budget planning.

What To Do Next

Benchmark your Qwen3.5 27B costs with vLLM on similar GPUs using their Python script.

Who should care:Developers & AI Engineers

AI-generated analysis for this event.

•Qwen3.5 27B utilizes a Mixture-of-Experts (MoE) architecture, which significantly optimizes inference costs by activating only a subset of parameters per token, explaining the high throughput relative to power consumption.
•The 0.30€/kWh electricity rate used in the analysis is representative of current average industrial/commercial energy costs in parts of the EU, highlighting the economic viability of self-hosting versus cloud API usage for high-volume workloads.
•The performance disparity between prompt processing (1691 TPS) and generation (53.8 TPS) is characteristic of the memory-bandwidth-bound nature of autoregressive decoding in LLMs, even when utilizing dual-GPU setups.

📊 Competitor Analysis▸ Show

Model	Architecture	Est. Efficiency (Tokens/kWh)	Primary Use Case
Qwen3.5 27B	MoE	High	Local/On-prem Inference
Llama 3.3 70B	Dense	Moderate	Enterprise RAG/Reasoning
Mistral Large 2	Dense	Moderate	Cloud API/High-end Tasks

•Model Architecture: Qwen3.5 27B employs a sparse Mixture-of-Experts (MoE) design, allowing for lower compute requirements per token compared to dense models of similar total parameter counts.
•Inference Engine: The user utilized vLLM, which leverages PagedAttention to optimize KV cache memory management, significantly reducing memory fragmentation and increasing throughput.
•Hardware Configuration: The setup uses a heterogeneous GPU configuration (RTX 3090 + RTX Pro 4000), suggesting the use of model parallelism (likely tensor parallelism) to distribute the model weights across disparate VRAM capacities.
•Power Profile: The 535W draw represents the combined TDP/actual load of the dual-GPU system, including overhead from the host system, which is a critical factor in calculating the 'cost-per-token' metric.

Self-hosted MoE models will become the standard for cost-sensitive enterprise applications.

The ability to achieve sub-1€ per million tokens through hardware optimization significantly undercuts commercial API pricing models.

Hardware-specific optimization will shift from general-purpose to model-architecture-aware scheduling.

As MoE models become more prevalent, inference engines will increasingly prioritize intelligent expert-routing to maximize GPU utilization.

2025-09

Alibaba Cloud releases the Qwen3 series, introducing advanced MoE architectures.

2026-01

Qwen3.5 update released, featuring improved reasoning capabilities and optimized inference efficiency.

Weekly AI Recap

Read this week's curated digest of top AI events →

Same topic

Explore #llm-costs

Same product