๐Ÿฆ™Stalecollected in 2h

Qwen3.5-35B Crashes with Claude Code on llama.cpp

PostLinkedIn
๐Ÿฆ™Read original on Reddit r/LocalLLaMA

๐Ÿ’กQwen3.5-35B + llama.cpp claude code crashes โ€“ fix or workaround?

โšก 30-Second TL;DR

What Changed

Crashes/unloads on 2nd or 3rd prompt with claude code interpreter

Why It Matters

Highlights compatibility issues for Qwen3.5-35B-A3B in code interpreter workflows on llama.cpp, pushing users toward alternatives like OpenCode.

What To Do Next

Switch to OpenCode interpreter for Qwen3.5-35B-A3B on llama.cpp to avoid crashes.

Who should care:Developers & AI Engineers

๐Ÿง  Deep Insight

Web-grounded analysis with 9 cited sources.

๐Ÿ”‘ Enhanced Key Takeaways

  • โ€ขQwen3.5 models in llama.cpp force full prompt reprocessing on every generation due to KV cache truncation failures, leading to inefficiency in multi-turn conversations[1][8].
  • โ€ขMultiple GitHub issues document Qwen3-Coder-Next crashes from empty grammar stacks during long-context (80K+ tokens) tool calls across CUDA, ROCm, and Vulkan backends[1][2].
  • โ€ขROCm support for newer Qwen3.5 models is broken in recent llama.cpp builds, causing failures while Vulkan works reliably[9].

๐Ÿ› ๏ธ Technical Deep Dive

  • โ€ขQwen3.5moe bug triggers full prompt reprocess after each message due to 'failed to truncate' errors in token cache, confirmed in llama.cpp issue #19858[1][8].
  • โ€ขLong-prompt crashes (20K+ tokens) with Qwen3.5MoE on multi-GPU setups stem from CUDA graph optimizations; fixed by replacing in-place tensor ops with non-inplace versions[1].
  • โ€ขQwen3-Coder-Next GGUF fails on Windows due to unknown architecture errors and crashes on tool calls from invalid JSON with duplicate fields[1].
  • โ€ขTool calling requires specific flags like --tool-call-parser qwen3_coder and higher quantization (Q6_K+) for proper schema adherence[4].

๐Ÿ”ฎ Future ImplicationsAI analysis grounded in cited sources

llama.cpp b8179+ will patch Qwen3.5 full reprocessing via improved KV cache truncation
Ongoing GitHub issues like #19858 detail the truncation failure as the root cause, with community bisecting and patches already resolving similar bugs in prior builds[1][8].
Qwen3.5 stability on ROCm will lag Vulkan by 1-2 months
Issue #19880 confirms ROCm breakage for new Qwen models while Vulkan succeeds, mirroring patterns in prior Qwen3-Coder-Next ROCm crashes fixed via targeted updates[1][9].

โณ Timeline

2026-02
llama.cpp weekly report documents Qwen3.5moe full reprocess bug and long-prompt multi-GPU crashes
2026-02
Qwen3-Coder-Next GGUF crashes reported on ROCm, Windows, and tool calls in llama.cpp issues
2026-02
Issue #19858 opened: Qwen3.5 forces full prompt reprocessing due to cache truncation failure
2026-02
Issue #19860 opened: CUDA errors and crashes with Qwen3.5-27B in llama-bench/server
2026-02
Issue #19880 opened: ROCm support broken for newer Qwen3.5 models in llama.cpp
๐Ÿ“ฐ

Weekly AI Recap

Read this week's curated digest of top AI events โ†’

๐Ÿ‘‰Related Updates

AI-curated news aggregator. All content rights belong to original publishers.
Original source: Reddit r/LocalLLaMA โ†—