DeepSeek's DualPath Boosts Agent LLM Inference 1.9x

Post LinkedIn

🧠Read original on 机器之心

#kv-cache #agentic-llm #storage-bandwidthdeepseek-dualpath

💡1.9x throughput for agent LLMs via KV-Cache bandwidth pooling

⚡ 30-Second TL;DR

What Changed

Introduces DualPath for agentic LLM workloads with 'long context, short append' characteristics

Why It Matters

DualPath enables scalable multi-turn agent deployments by breaking per-node I/O limits, potentially cutting inference costs for production AI agents.

What To Do Next

Download the DualPath paper from arXiv and prototype the dual-path KV-Cache loading in your LLM serving cluster.

Who should care:Researchers & Academics

🧠 Deep Insight

Web-grounded analysis with 7 cited sources.

🔑 Enhanced Key Takeaways

•DualPath was tested on DeepSeek-V3 (660B scale) and Qwen models in both offline rollout and online service scenarios, significantly reducing Time To First Token (TTFT) under high load while keeping Token-to-Token (TBT) stable[1][2][3].
•The system uses 80GB DRAM per node for DeepSeek models and implements on an in-house framework with ~5K lines of code modifications, incorporating FlashMLA, DeepGEMM, DeepEP kernels, 3FS storage, and io_uring-like kernel bypass[3].
•Paper submitted to arXiv on February 25, 2026, by authors from DeepSeek, Tsinghua, and Peking University, targeting disaggregated architectures with production agentic workloads featuring high KV-Cache reuse[2].

🛠️ Technical Deep Dive

•Implements dual-path KV-Cache loading: traditional storage-to-prefill plus novel storage-to-decode path with RDMA transfer over compute network to avoid congestion and interference[2][3].
•Global scheduler dynamically balances load across prefill and decode engines; uses adaptive scheduling and strict traffic isolation[1][2].
•Built on in-house inference stack with FlashMLA (Li and Liu, 2025), DeepGEMM (DeepSeek-AI, 2025b), DeepEP (Zhao et al., 2025b) CUDA kernels; ~5K LOC modifications[3].
•Storage via 3FS (DeepSeek-AI, 2025a) with io_uring-like interface for kernel bypass; evaluated on three models with long contexts[2][3].
•For DeepSeek models: 80GB DRAM/node; Qwen 32B uses 1.5TB DRAM/node total in comparisons[3].

🔮 Future ImplicationsAI analysis grounded in cited sources

DualPath will be integrated into DeepSeek's V4 inference stack

The paper aligns with V4 framework developments and uses kernels like DeepGEMM/DeepEP from DeepSeek's 2025 publications, tested on V3 leading to V4[1][3][4].

KV-Cache optimizations like DualPath reduce inference costs by 1.9x without hardware upgrades

Evaluations show 1.87x offline and 1.96x online throughput gains on production workloads by pooling idle NIC bandwidth[1][2].

⏳ Timeline

2025-01

DeepSeek-AI publishes 3FS distributed storage

2025-01

DeepGEMM kernel released by DeepSeek-AI

2025-12

DeepSeek-V3 model released

2026-02

DeepSeek V4 model launches mid-February

2026-02-25

DualPath paper submitted to arXiv

📎 Sources (7)

Factual claims are grounded in the sources below. Forward-looking analysis is AI-generated interpretation.

🧠Read original article on 机器之心

📰

Weekly AI Recap

Read this week's curated digest of top AI events →

👉Related Updates

Same topic

Explore #kv-cache

Same product