Krasis Hits 3324 tok/s Prefill on RTX 5080

Post LinkedIn

🦙Read original on Reddit r/LocalLLaMA

#moe-runtime #hybrid-inference #benchmarkskrasis

💡New runtime runs 80B MoE at 3k+ tok/s prefill on one 5080

⚡ 30-Second TL;DR

What Changed

GPU prefill at 3,324 tok/s on RTX 5080 for 80B MoE

Why It Matters

Enables practical local inference of huge MoE models on consumer GPUs, slashing prefill times for IDE/tools.

What To Do Next

Download Krasis from source and benchmark Qwen3-Coder-Next Q4 on your NVIDIA GPU.

Who should care:Developers & AI Engineers

🧠 Deep Insight

Web-grounded analysis with 7 cited sources.

🔑 Enhanced Key Takeaways

•RTX 5080 uses Blackwell architecture with 10,752 CUDA cores and 16GB GDDR7 VRAM, enabling high AI inference speeds but limited to smaller or quantized models due to VRAM constraints.[1][2]
•In general LLM benchmarks, RTX 5080 achieves around 135 tok/s with two loaded models and up to 26.1 tok/s in specific Vulkan tests, far below Krasis's 3324 tok/s prefill.[4]
•RTX 5080 outperforms RTX 6000 Ada in Mistral (4635 vs 4255) and Llama2 (4790 vs 3957) tests but trails RTX 5090 and RTX 4090 in most AI workloads.[1]

🔮 Future ImplicationsAI analysis grounded in cited sources

Krasis enables consumer GPUs to run 80B MoE models beyond VRAM limits

By offloading decode to CPU and requiring 2.5x model size in RAM, it leverages high system memory to bypass 16GB VRAM constraints on RTX 5080.

Hybrid CPU/GPU will drive local inference for massive models on desktops

RTX 5080's Blackwell Tensor Cores boost prefill to extreme speeds like 3324 tok/s, making high-end PCs viable alternatives to data center hardware for MoE models.

📎 Sources (7)

Factual claims are grounded in the sources below. Forward-looking analysis is AI-generated interpretation.

🦙Read original article on Reddit r/LocalLLaMA

📰

Weekly AI Recap

Read this week's curated digest of top AI events →

👉Related Updates

Same topic

Explore #moe-runtime

Same product

Deepseek Vanishes from AI Spotlight?

Reddit r/LocalLLaMA•Apr 10

AI-curated news aggregator. All content rights belong to original publishers.
Original source: Reddit r/LocalLLaMA ↗