🧠Stalecollected in 3h

DeepSeek's DualPath Boosts Agent LLM Inference 1.9x

DeepSeek's DualPath Boosts Agent LLM Inference 1.9x
PostLinkedIn
🧠Read original on 机器之心

💡1.9x throughput for agent LLMs via KV-Cache bandwidth pooling

⚡ 30-Second TL;DR

What Changed

Introduces DualPath for agentic LLM workloads with 'long context, short append' characteristics

Why It Matters

DualPath enables scalable multi-turn agent deployments by breaking per-node I/O limits, potentially cutting inference costs for production AI agents.

What To Do Next

Download the DualPath paper from arXiv and prototype the dual-path KV-Cache loading in your LLM serving cluster.

Who should care:Researchers & Academics

🧠 Deep Insight

Web-grounded analysis with 7 cited sources.

🔑 Enhanced Key Takeaways

  • DualPath was tested on DeepSeek-V3 (660B scale) and Qwen models in both offline rollout and online service scenarios, significantly reducing Time To First Token (TTFT) under high load while keeping Token-to-Token (TBT) stable[1][2][3].
  • The system uses 80GB DRAM per node for DeepSeek models and implements on an in-house framework with ~5K lines of code modifications, incorporating FlashMLA, DeepGEMM, DeepEP kernels, 3FS storage, and io_uring-like kernel bypass[3].
  • Paper submitted to arXiv on February 25, 2026, by authors from DeepSeek, Tsinghua, and Peking University, targeting disaggregated architectures with production agentic workloads featuring high KV-Cache reuse[2].

🛠️ Technical Deep Dive

  • Implements dual-path KV-Cache loading: traditional storage-to-prefill plus novel storage-to-decode path with RDMA transfer over compute network to avoid congestion and interference[2][3].
  • Global scheduler dynamically balances load across prefill and decode engines; uses adaptive scheduling and strict traffic isolation[1][2].
  • Built on in-house inference stack with FlashMLA (Li and Liu, 2025), DeepGEMM (DeepSeek-AI, 2025b), DeepEP (Zhao et al., 2025b) CUDA kernels; ~5K LOC modifications[3].
  • Storage via 3FS (DeepSeek-AI, 2025a) with io_uring-like interface for kernel bypass; evaluated on three models with long contexts[2][3].
  • For DeepSeek models: 80GB DRAM/node; Qwen 32B uses 1.5TB DRAM/node total in comparisons[3].

🔮 Future ImplicationsAI analysis grounded in cited sources

DualPath will be integrated into DeepSeek's V4 inference stack
The paper aligns with V4 framework developments and uses kernels like DeepGEMM/DeepEP from DeepSeek's 2025 publications, tested on V3 leading to V4[1][3][4].
KV-Cache optimizations like DualPath reduce inference costs by 1.9x without hardware upgrades
Evaluations show 1.87x offline and 1.96x online throughput gains on production workloads by pooling idle NIC bandwidth[1][2].

Timeline

2025-01
DeepSeek-AI publishes 3FS distributed storage
2025-01
DeepGEMM kernel released by DeepSeek-AI
2025-12
DeepSeek-V3 model released
2026-02
DeepSeek V4 model launches mid-February
2026-02-25
DualPath paper submitted to arXiv
📰

Weekly AI Recap

Read this week's curated digest of top AI events →

👉Related Updates

AI-curated news aggregator. All content rights belong to original publishers.
Original source: 机器之心