🤝Stalecollected in 39h

CPD: 40% Faster Long-Context LLM Serving

PostLinkedIn
🤝Read original on Together AI Blog

💡40% faster long-context LLM serving via CPD—essential for production inference scaling.

⚡ 30-Second TL;DR

What Changed

Introduces CPD architecture for disaggregated LLM inference

Why It Matters

CPD enables scalable production serving of long-context LLMs, cutting latency and costs for real-world apps. AI builders gain efficiency in handling extended inputs like full documents or conversations.

What To Do Next

Test Together AI's CPD serving endpoint with your long-context LLM prompts for 40% throughput gains.

Who should care:Developers & AI Engineers

🧠 Deep Insight

Web-grounded analysis with 8 cited sources.

🔑 Enhanced Key Takeaways

  • CPD's three-level KV-cache hierarchy (GPU memory, host DRAM, and cluster-wide distributed cache via RDMA) enables frequently accessed contexts to migrate closer to GPU over time, reducing latency from seconds to hundreds of milliseconds[3]
  • Together AI's infrastructure achieves sub-second to low-second median time-to-first-token even under saturation conditions where baseline systems exceed one second, validated on synthetic coding agent workloads mirroring real AI-assisted development scenarios[2]
  • The architecture directly addresses the economics of long-context serving: 35-40% throughput improvements translate to measurable infrastructure cost savings or increased user capacity on existing hardware for organizations running AI agents and retrieval-augmented generation systems[2]

🛠️ Technical Deep Dive

  • CPD separates prefill and decode phases by cache hit rate, isolating heavy prefills to dedicated pre-prefill nodes[3]
  • KV cache state from cold requests is written to distributed cache; subsequent similar requests fetch this state in bulk at high bandwidth, converting seconds of compute into hundreds of milliseconds of transfer and light recomputation[3]
  • System evaluated on two critical dimensions: latency and throughput scaling under increasing load (TTFT p50/p90 and per-GPU throughput), and effective serving capacity under contention (sustainable QPS before prefill-side saturation)[3]
  • Decode-side parallelism (increasing from 1D to 2D configurations) improves overall throughput and delays saturation for both baseline and CPD configurations[3]
  • Tested on NVIDIA B200 GPUs with synthetic workloads designed to mirror real AI-assisted development scenarios featuring large shared codebase context with multi-turn interactions[2]

🔮 Future ImplicationsAI analysis grounded in cited sources

Long-context serving will become cost-competitive with short-context inference, eliminating the current economic penalty for expanded context windows
40% throughput gains directly reduce per-token serving costs, making retrieval-augmented generation and multi-turn agent systems economically viable at scale[2]
Workload-aware infrastructure separation will become standard practice rather than optimization edge cases
As foundation models continue expanding context capabilities, the industry will need increasingly sophisticated cache-aware disaggregation to avoid letting expensive cold prompts dominate shared resources[2]
Distributed KV cache hierarchies will enable new multi-region deployment patterns for latency-sensitive applications
RDMA-connected cluster-wide caching allows frequently accessed contexts to be shared across geographically distributed inference nodes, reducing recomputation costs[3]

Timeline

2026-02
Together AI unveils cache-aware prefill-decode disaggregation (CPD) architecture achieving 35-40% throughput improvement on NVIDIA B200 GPUs
2026-02
CPD research published demonstrating sub-second TTFT maintenance under load conditions where baseline systems saturate
📰

Weekly AI Recap

Read this week's curated digest of top AI events →

👉Related Updates

AI-curated news aggregator. All content rights belong to original publishers.
Original source: Together AI Blog