๐Ÿฆ™Stalecollected in 7h

Run 35B/120B Models on 5060ti + 1080ti

PostLinkedIn
๐Ÿฆ™Read original on Reddit r/LocalLLaMA

๐Ÿ’กHack: Combine 5060ti+1080ti for 60t/s on 35B Qwen via llama.cpp RPCโ€”full guide

โšก 30-Second TL;DR

What Changed

Qwen3.5-35B-A3B Q4_K_M: 60tok/s on 5060ti + 1080ti via RPC

Why It Matters

Enables inference of massive models on consumer/old GPUs, extending hardware lifespan and democratizing access to 100B+ LLMs locally.

What To Do Next

Build llama.cpp with CUDA and RPC flags, then test Qwen3.5-35B on mixed GPUs using VM passthrough.

Who should care:Developers & AI Engineers

๐Ÿง  Deep Insight

Web-grounded analysis with 9 cited sources.

๐Ÿ”‘ Enhanced Key Takeaways

  • โ€ขllama.cpp RPC mode enables distributed inference across heterogeneous GPUs by handling tensor transfers and synchronization, though TCP/IP latency prevents performance gains compared to single-node setupsโ€”tensor parallelism via vLLM/SGLang/TRTLLM with NCCL over RDMA is significantly more efficient for large models[1].
  • โ€ขRecent llama.cpp optimizations (as of January-February 2026) include WebGPU backend software pipelining for flash attention, RISC-V vector support yielding 46% speedup on float32 operations, and AMD EPYC tiled flash attention for long-context prompt processing, expanding hardware compatibility beyond NVIDIA[7][8].
  • โ€ขMixed GPU architectures using VM passthrough to bypass driver conflicts represents a workaround rather than an optimal solution; native support for heterogeneous GPU inference in llama.cpp remains limited compared to frameworks like vLLM that implement proper tensor splitting across device types[1][3].

๐Ÿ› ๏ธ Technical Deep Dive

llama.cpp RPC Architecture & Limitations:

  • RPC mode does not support mixed CPU and GPU offload; GPU offload only is functional[3]
  • TCP/IP communication kills performance; llama.cpp cannot implement RDMA, limiting distributed speedup to pipeline parallelism (PP) only[1]
  • Tensor parallelism (TP) requires frameworks like vLLM/SGLang/TRTLLM that use NCCL over RDMA with microsecond-level latency for all-reduce operations after each layer[1]

GPU Optimization Flags:

  • -DGGML_CUDA=ON enables NVIDIA CUDA acceleration
  • -DGGML_RPC=ON enables RPC for distributed inference
  • -fa on enables flash attention (rocWMMA on AMD, standard on NVIDIA)
  • -ngl parameter specifies GPU layer offload count (999 ensures full GPU offload)[5]
  • --CUDA_GRAPH_OPT=1 enables concurrent CUDA streams for QKV projections in newer builds[1]

Performance Benchmarks (Recent Data):

  • Blackwell optimizations: gpt-oss-120b prompt processing improved from ~1900 t/s to ~2400 t/s[1]
  • CUDA vs. Vulkan on RTX 3060: CUDA provides ~7% performance boost over Vulkan[4]
  • RPC overhead example: Q9650 CPU + GTX 1070 showed 566.47 t/s prompt processing vs. 601.75 t/s local GPU mode (5.9% degradation)[3]

๐Ÿ”ฎ Future ImplicationsAI analysis grounded in cited sources

RDMA support in llama.cpp would unlock true distributed inference gains, but current TCP/IP architecture fundamentally limits multi-GPU scaling beyond pipeline parallelism.
vLLM's tensor parallelism with NCCL/RDMA achieves microsecond all-reduce latency, whereas llama.cpp RPC shows measurable slowdowns; architectural changes would be required to match this efficiency.
Heterogeneous GPU setups (mixed architectures) will require native framework support rather than VM workarounds as model sizes exceed 100B parameters.
VM passthrough solves driver conflicts but introduces virtualization overhead; frameworks like vLLM already support mixed-precision and multi-architecture tensor splitting natively.

โณ Timeline

2024-05
llama.cpp RPC mode stabilized with SO_REUSEADDR socket fix and mixed CPU/GPU offload support improvements
2025-01
RISC-V vector support added for SSM scan operations (46% speedup); AMD EPYC tiled flash attention optimization deployed
2025-08
rocWMMA flash attention library integration for RDNA3+ and CDNA architectures; performance benchmarking on AMD 7900 XTX and ROCm 6.3.4
2025-09
Vulkan backend established as primary AMD GPU inference path; demonstrated parity with ROCm for LLM inference on consumer GPUs
2026-01
WebGPU backend refactored with software pipelining and vectorization for flash attention; multi-GPU CUDA crash issues reported and addressed
2026-03
AMD Ryzen AI Max+ cluster support documented with llama.cpp RPC sharding across nodes; Lemonade SDK nightly builds with ROCm 7 acceleration
๐Ÿ“ฐ

Weekly AI Recap

Read this week's curated digest of top AI events โ†’

๐Ÿ‘‰Related Updates

AI-curated news aggregator. All content rights belong to original publishers.
Original source: Reddit r/LocalLLaMA โ†—