๐Ÿฆ™Stalecollected in 13h

Krasis Hits 3324 tok/s Prefill on RTX 5080

Krasis Hits 3324 tok/s Prefill on RTX 5080
PostLinkedIn
๐Ÿฆ™Read original on Reddit r/LocalLLaMA

๐Ÿ’กNew runtime runs 80B MoE at 3k+ tok/s prefill on one 5080

โšก 30-Second TL;DR

What Changed

GPU prefill at 3,324 tok/s on RTX 5080 for 80B MoE

Why It Matters

Enables practical local inference of huge MoE models on consumer GPUs, slashing prefill times for IDE/tools.

What To Do Next

Download Krasis from source and benchmark Qwen3-Coder-Next Q4 on your NVIDIA GPU.

Who should care:Developers & AI Engineers

๐Ÿง  Deep Insight

Web-grounded analysis with 7 cited sources.

๐Ÿ”‘ Enhanced Key Takeaways

  • โ€ขRTX 5080 uses Blackwell architecture with 10,752 CUDA cores and 16GB GDDR7 VRAM, enabling high AI inference speeds but limited to smaller or quantized models due to VRAM constraints.[1][2]
  • โ€ขIn general LLM benchmarks, RTX 5080 achieves around 135 tok/s with two loaded models and up to 26.1 tok/s in specific Vulkan tests, far below Krasis's 3324 tok/s prefill.[4]
  • โ€ขRTX 5080 outperforms RTX 6000 Ada in Mistral (4635 vs 4255) and Llama2 (4790 vs 3957) tests but trails RTX 5090 and RTX 4090 in most AI workloads.[1]

๐Ÿ”ฎ Future ImplicationsAI analysis grounded in cited sources

Krasis enables consumer GPUs to run 80B MoE models beyond VRAM limits
By offloading decode to CPU and requiring 2.5x model size in RAM, it leverages high system memory to bypass 16GB VRAM constraints on RTX 5080.
Hybrid CPU/GPU will drive local inference for massive models on desktops
RTX 5080's Blackwell Tensor Cores boost prefill to extreme speeds like 3324 tok/s, making high-end PCs viable alternatives to data center hardware for MoE models.
๐Ÿ“ฐ

Weekly AI Recap

Read this week's curated digest of top AI events โ†’

๐Ÿ‘‰Related Updates

AI-curated news aggregator. All content rights belong to original publishers.
Original source: Reddit r/LocalLLaMA โ†—