Run 35B/120B Models on 5060ti + 1080ti
๐กHack: Combine 5060ti+1080ti for 60t/s on 35B Qwen via llama.cpp RPCโfull guide
โก 30-Second TL;DR
What Changed
Qwen3.5-35B-A3B Q4_K_M: 60tok/s on 5060ti + 1080ti via RPC
Why It Matters
Enables inference of massive models on consumer/old GPUs, extending hardware lifespan and democratizing access to 100B+ LLMs locally.
What To Do Next
Build llama.cpp with CUDA and RPC flags, then test Qwen3.5-35B on mixed GPUs using VM passthrough.
๐ง Deep Insight
Web-grounded analysis with 9 cited sources.
๐ Enhanced Key Takeaways
- โขllama.cpp RPC mode enables distributed inference across heterogeneous GPUs by handling tensor transfers and synchronization, though TCP/IP latency prevents performance gains compared to single-node setupsโtensor parallelism via vLLM/SGLang/TRTLLM with NCCL over RDMA is significantly more efficient for large models[1].
- โขRecent llama.cpp optimizations (as of January-February 2026) include WebGPU backend software pipelining for flash attention, RISC-V vector support yielding 46% speedup on float32 operations, and AMD EPYC tiled flash attention for long-context prompt processing, expanding hardware compatibility beyond NVIDIA[7][8].
- โขMixed GPU architectures using VM passthrough to bypass driver conflicts represents a workaround rather than an optimal solution; native support for heterogeneous GPU inference in llama.cpp remains limited compared to frameworks like vLLM that implement proper tensor splitting across device types[1][3].
๐ ๏ธ Technical Deep Dive
llama.cpp RPC Architecture & Limitations:
- RPC mode does not support mixed CPU and GPU offload; GPU offload only is functional[3]
- TCP/IP communication kills performance; llama.cpp cannot implement RDMA, limiting distributed speedup to pipeline parallelism (PP) only[1]
- Tensor parallelism (TP) requires frameworks like vLLM/SGLang/TRTLLM that use NCCL over RDMA with microsecond-level latency for all-reduce operations after each layer[1]
GPU Optimization Flags:
-DGGML_CUDA=ONenables NVIDIA CUDA acceleration-DGGML_RPC=ONenables RPC for distributed inference-fa onenables flash attention (rocWMMA on AMD, standard on NVIDIA)-nglparameter specifies GPU layer offload count (999 ensures full GPU offload)[5]--CUDA_GRAPH_OPT=1enables concurrent CUDA streams for QKV projections in newer builds[1]
Performance Benchmarks (Recent Data):
- Blackwell optimizations: gpt-oss-120b prompt processing improved from ~1900 t/s to ~2400 t/s[1]
- CUDA vs. Vulkan on RTX 3060: CUDA provides ~7% performance boost over Vulkan[4]
- RPC overhead example: Q9650 CPU + GTX 1070 showed 566.47 t/s prompt processing vs. 601.75 t/s local GPU mode (5.9% degradation)[3]
๐ฎ Future ImplicationsAI analysis grounded in cited sources
โณ Timeline
๐ Sources (9)
Factual claims are grounded in the sources below. Forward-looking analysis is AI-generated interpretation.
- forums.developer.nvidia.com โ 355864
- GitHub โ 15021
- GitHub โ 7293
- youtube.com โ Watch
- amd.com โ How to Run a One Trillion Parameter LLM Locally an Amd
- wiki.seeedstudio.com โ AI Robotics Distributed Llama Cpp Rpc Jetson
- buttondown.com โ Weekly Github Report for Llamacpp January 16 2026
- buttondown.com โ Weekly Github Report for Llamacpp January 25 2026 7750
- buttondown.com โ Weekly Github Report for Llamacpp February 16 5844
Weekly AI Recap
Read this week's curated digest of top AI events โ
๐Related Updates
AI-curated news aggregator. All content rights belong to original publishers.
Original source: Reddit r/LocalLLaMA โ