๐Ÿฆ™Stalecollected in 9h

Fix Qwen 3.5 Instruct Prompt Reprocessing

PostLinkedIn
๐Ÿฆ™Read original on Reddit r/LocalLLaMA

๐Ÿ’กStops wasteful prompt reprocessing in Qwen 3.5 instruct โ€“ faster local chats.

โšก 30-Second TL;DR

What Changed

Fixes reprocessing of last message in instruct mode

Why It Matters

Improves efficiency in long conversations by preventing unnecessary prompt reprocessing, saving compute in local inference setups.

What To Do Next

Save template as chat_template.jinja and run llama.cpp with --chat-template-file chat_template.jinja.

Who should care:Developers & AI Engineers

๐Ÿง  Deep Insight

Web-grounded analysis with 8 cited sources.

๐Ÿ”‘ Enhanced Key Takeaways

  • โ€ขQwen 3.5 employs a hybrid attention + Mamba2/SSM architecture in models like 27B, contributing to the recurrent nature that triggers full prompt reprocessing in llama.cpp[1].
  • โ€ขOfficial Hugging Face documentation specifies disabling thinking mode via 'chat_template_kwargs': {"enable_thinking": False} in API calls for direct responses without <think> blocks[3].
  • โ€ขThe issue is corroborated in multiple llama.cpp GitHub issues (#20225 and #19858), with workarounds like disabling mmproj temporarily resolving reprocessing even for vision-enabled setups[1][7].

๐Ÿ› ๏ธ Technical Deep Dive

  • โ€ขQwen3.5 models default to thinking mode, outputting <think>\n...\n\n</think> before final responses; instruct mode requires explicit 'enable_thinking': False parameter[3].
  • โ€ขHybrid architecture in Qwen3.5-27B combines attention with Mamba2/SSM (State Space Models), enabling efficient long-context handling up to 262,144 tokens but causing recurrent eval in llama.cpp[1][3].
  • โ€ขUnsloth provides optimized inference settings for Qwen3.5: instruct mode uses temperature=0.7, top_p=0.8, presence_penalty=1.5; chat template fixes improve tool-calling universally across GGUF formats[4].
  • โ€ขLlama.cpp bug manifests as full prompt reevaluation per turn due to empty think block stripping; custom Jinja preserves empty <think> tags while stripping content-filled ones[1][7].

๐Ÿ”ฎ Future ImplicationsAI analysis grounded in cited sources

Llama.cpp will merge official PR fixing Qwen3.5 reprocessing, reducing need for custom templates.
Multiple GitHub issues reference ongoing PRs that address the root cause even with vision/mmproj enabled, as confirmed by community testing[1][7].
Qwen3.5 instruct mode optimizations will standardize across inference engines like vLLM and SGLang.
Official docs recommend engine parameters for rope scaling and thinking disable, indicating broader ecosystem compatibility beyond llama.cpp[3].

โณ Timeline

2026-01
Qwen3.5 release with native multimodal agents and default thinking mode
2026-02
Qwen3.5-Plus snapshot with 1M context and adaptive tool use in thinking mode
2026-03
Llama.cpp issues #19858/#20225 opened reporting full prompt reprocessing bug
๐Ÿ“ฐ

Weekly AI Recap

Read this week's curated digest of top AI events โ†’

๐Ÿ‘‰Related Updates

AI-curated news aggregator. All content rights belong to original publishers.
Original source: Reddit r/LocalLLaMA โ†—