Run 35B/120B Models on 5060ti + 1080ti

🔑 Enhanced Key Takeaways

•llama.cpp RPC mode enables distributed inference across heterogeneous GPUs by handling tensor transfers and synchronization, though TCP/IP latency prevents performance gains compared to single-node setups—tensor parallelism via vLLM/SGLang/TRTLLM with NCCL over RDMA is significantly more efficient for large models[1].
•Recent llama.cpp optimizations (as of January-February 2026) include WebGPU backend software pipelining for flash attention, RISC-V vector support yielding 46% speedup on float32 operations, and AMD EPYC tiled flash attention for long-context prompt processing, expanding hardware compatibility beyond NVIDIA[7][8].
•Mixed GPU architectures using VM passthrough to bypass driver conflicts represents a workaround rather than an optimal solution; native support for heterogeneous GPU inference in llama.cpp remains limited compared to frameworks like vLLM that implement proper tensor splitting across device types[1][3].

🛠️ Technical Deep Dive

llama.cpp RPC Architecture & Limitations:

RPC mode does not support mixed CPU and GPU offload; GPU offload only is functional[3]
TCP/IP communication kills performance; llama.cpp cannot implement RDMA, limiting distributed speedup to pipeline parallelism (PP) only[1]
Tensor parallelism (TP) requires frameworks like vLLM/SGLang/TRTLLM that use NCCL over RDMA with microsecond-level latency for all-reduce operations after each layer[1]

GPU Optimization Flags:

-DGGML_CUDA=ON enables NVIDIA CUDA acceleration
-DGGML_RPC=ON enables RPC for distributed inference
-fa on enables flash attention (rocWMMA on AMD, standard on NVIDIA)
-ngl parameter specifies GPU layer offload count (999 ensures full GPU offload)[5]
--CUDA_GRAPH_OPT=1 enables concurrent CUDA streams for QKV projections in newer builds[1]

Performance Benchmarks (Recent Data):

Blackwell optimizations: gpt-oss-120b prompt processing improved from ~1900 t/s to ~2400 t/s[1]
CUDA vs. Vulkan on RTX 3060: CUDA provides ~7% performance boost over Vulkan[4]
RPC overhead example: Q9650 CPU + GTX 1070 showed 566.47 t/s prompt processing vs. 601.75 t/s local GPU mode (5.9% degradation)[3]

🔮 Future ImplicationsAI analysis grounded in cited sources

RDMA support in llama.cpp would unlock true distributed inference gains, but current TCP/IP architecture fundamentally limits multi-GPU scaling beyond pipeline parallelism.

vLLM's tensor parallelism with NCCL/RDMA achieves microsecond all-reduce latency, whereas llama.cpp RPC shows measurable slowdowns; architectural changes would be required to match this efficiency.

Heterogeneous GPU setups (mixed architectures) will require native framework support rather than VM workarounds as model sizes exceed 100B parameters.

VM passthrough solves driver conflicts but introduces virtualization overhead; frameworks like vLLM already support mixed-precision and multi-architecture tensor splitting natively.

⏳ Timeline

2024-05

llama.cpp RPC mode stabilized with SO_REUSEADDR socket fix and mixed CPU/GPU offload support improvements

2025-01

RISC-V vector support added for SSM scan operations (46% speedup); AMD EPYC tiled flash attention optimization deployed

2025-08

rocWMMA flash attention library integration for RDNA3+ and CDNA architectures; performance benchmarking on AMD 7900 XTX and ROCm 6.3.4

2025-09

Vulkan backend established as primary AMD GPU inference path; demonstrated parity with ROCm for LLM inference on consumer GPUs

2026-01

WebGPU backend refactored with software pipelining and vectorization for flash attention; multi-GPU CUDA crash issues reported and addressed

2026-03

AMD Ryzen AI Max+ cluster support documented with llama.cpp RPC sharding across nodes; Lemonade SDK nightly builds with ROCm 7 acceleration

Run 35B/120B Models on 5060ti + 1080ti

⚡ 30-Second TL;DR

🧠 Deep Insight

🔑 Enhanced Key Takeaways

🛠️ Technical Deep Dive

🔮 Future ImplicationsAI analysis grounded in cited sources

⏳ Timeline

📎 Sources (9)

👉Related Updates