CPD: 40% Faster Long-Context LLM Serving
💡40% faster long-context LLM serving via CPD—essential for production inference scaling.
⚡ 30-Second TL;DR
What Changed
Introduces CPD architecture for disaggregated LLM inference
Why It Matters
CPD enables scalable production serving of long-context LLMs, cutting latency and costs for real-world apps. AI builders gain efficiency in handling extended inputs like full documents or conversations.
What To Do Next
Test Together AI's CPD serving endpoint with your long-context LLM prompts for 40% throughput gains.
🧠 Deep Insight
Web-grounded analysis with 8 cited sources.
🔑 Enhanced Key Takeaways
- •CPD's three-level KV-cache hierarchy (GPU memory, host DRAM, and cluster-wide distributed cache via RDMA) enables frequently accessed contexts to migrate closer to GPU over time, reducing latency from seconds to hundreds of milliseconds[3]
- •Together AI's infrastructure achieves sub-second to low-second median time-to-first-token even under saturation conditions where baseline systems exceed one second, validated on synthetic coding agent workloads mirroring real AI-assisted development scenarios[2]
- •The architecture directly addresses the economics of long-context serving: 35-40% throughput improvements translate to measurable infrastructure cost savings or increased user capacity on existing hardware for organizations running AI agents and retrieval-augmented generation systems[2]
🛠️ Technical Deep Dive
- •CPD separates prefill and decode phases by cache hit rate, isolating heavy prefills to dedicated pre-prefill nodes[3]
- •KV cache state from cold requests is written to distributed cache; subsequent similar requests fetch this state in bulk at high bandwidth, converting seconds of compute into hundreds of milliseconds of transfer and light recomputation[3]
- •System evaluated on two critical dimensions: latency and throughput scaling under increasing load (TTFT p50/p90 and per-GPU throughput), and effective serving capacity under contention (sustainable QPS before prefill-side saturation)[3]
- •Decode-side parallelism (increasing from 1D to 2D configurations) improves overall throughput and delays saturation for both baseline and CPD configurations[3]
- •Tested on NVIDIA B200 GPUs with synthetic workloads designed to mirror real AI-assisted development scenarios featuring large shared codebase context with multi-turn interactions[2]
🔮 Future ImplicationsAI analysis grounded in cited sources
⏳ Timeline
📎 Sources (8)
Factual claims are grounded in the sources below. Forward-looking analysis is AI-generated interpretation.
Weekly AI Recap
Read this week's curated digest of top AI events →
👉Related Updates
AI-curated news aggregator. All content rights belong to original publishers.
Original source: Together AI Blog ↗