🤝Together AI Blog•Mar 4, 2026Stalecollected in 39h

CPD: 40% Faster Long-Context LLM Serving

Post LinkedIn

🤝Read original on Together AI Blog

#llm-inference #long-context #cache-optimizationtogether-ai-cpd

💡40% faster long-context LLM serving via CPD—essential for production inference scaling.

⚡ 30-Second TL;DR

What Changed

Introduces CPD architecture for disaggregated LLM inference

Why It Matters

CPD enables scalable production serving of long-context LLMs, cutting latency and costs for real-world apps. AI builders gain efficiency in handling extended inputs like full documents or conversations.

What To Do Next

Test Together AI's CPD serving endpoint with your long-context LLM prompts for 40% throughput gains.

Who should care:Developers & AI Engineers

🧠 Deep Insight

Web-grounded analysis with 8 cited sources.

🔑 Enhanced Key Takeaways

•CPD's three-level KV-cache hierarchy (GPU memory, host DRAM, and cluster-wide distributed cache via RDMA) enables frequently accessed contexts to migrate closer to GPU over time, reducing latency from seconds to hundreds of milliseconds[3]
•Together AI's infrastructure achieves sub-second to low-second median time-to-first-token even under saturation conditions where baseline systems exceed one second, validated on synthetic coding agent workloads mirroring real AI-assisted development scenarios[2]
•The architecture directly addresses the economics of long-context serving: 35-40% throughput improvements translate to measurable infrastructure cost savings or increased user capacity on existing hardware for organizations running AI agents and retrieval-augmented generation systems[2]

🛠️ Technical Deep Dive

•CPD separates prefill and decode phases by cache hit rate, isolating heavy prefills to dedicated pre-prefill nodes[3]
•KV cache state from cold requests is written to distributed cache; subsequent similar requests fetch this state in bulk at high bandwidth, converting seconds of compute into hundreds of milliseconds of transfer and light recomputation[3]
•System evaluated on two critical dimensions: latency and throughput scaling under increasing load (TTFT p50/p90 and per-GPU throughput), and effective serving capacity under contention (sustainable QPS before prefill-side saturation)[3]
•Decode-side parallelism (increasing from 1D to 2D configurations) improves overall throughput and delays saturation for both baseline and CPD configurations[3]
•Tested on NVIDIA B200 GPUs with synthetic workloads designed to mirror real AI-assisted development scenarios featuring large shared codebase context with multi-turn interactions[2]

🔮 Future ImplicationsAI analysis grounded in cited sources

Long-context serving will become cost-competitive with short-context inference, eliminating the current economic penalty for expanded context windows

40% throughput gains directly reduce per-token serving costs, making retrieval-augmented generation and multi-turn agent systems economically viable at scale[2]

Workload-aware infrastructure separation will become standard practice rather than optimization edge cases

As foundation models continue expanding context capabilities, the industry will need increasingly sophisticated cache-aware disaggregation to avoid letting expensive cold prompts dominate shared resources[2]

Distributed KV cache hierarchies will enable new multi-region deployment patterns for latency-sensitive applications

RDMA-connected cluster-wide caching allows frequently accessed contexts to be shared across geographically distributed inference nodes, reducing recomputation costs[3]

⏳ Timeline

2026-02

Together AI unveils cache-aware prefill-decode disaggregation (CPD) architecture achieving 35-40% throughput improvement on NVIDIA B200 GPUs

2026-02

CPD research published demonstrating sub-second TTFT maintenance under load conditions where baseline systems saturate

📎 Sources (8)

Factual claims are grounded in the sources below. Forward-looking analysis is AI-generated interpretation.

🤝Read original article on Together AI Blog

📰

Weekly AI Recap

Read this week's curated digest of top AI events →

👉Related Updates

Same topic

Explore #llm-inference

Same product