Fix Qwen 3.5 Instruct Prompt Reprocessing

💡Stops wasteful prompt reprocessing in Qwen 3.5 instruct – faster local chats.

⚡ 30-Second TL;DR

What Changed

Fixes reprocessing of last message in instruct mode

Why It Matters

Improves efficiency in long conversations by preventing unnecessary prompt reprocessing, saving compute in local inference setups.

What To Do Next

Save template as chat_template.jinja and run llama.cpp with --chat-template-file chat_template.jinja.

Who should care:Developers & AI Engineers

Web-grounded analysis with 8 cited sources.

•Qwen 3.5 employs a hybrid attention + Mamba2/SSM architecture in models like 27B, contributing to the recurrent nature that triggers full prompt reprocessing in llama.cpp[1].
•Official Hugging Face documentation specifies disabling thinking mode via 'chat_template_kwargs': {"enable_thinking": False} in API calls for direct responses without blocks[3].
•The issue is corroborated in multiple llama.cpp GitHub issues (#20225 and #19858), with workarounds like disabling mmproj temporarily resolving reprocessing even for vision-enabled setups[1][7].

•Qwen3.5 models default to thinking mode, outputting \n...\n\n before final responses; instruct mode requires explicit 'enable_thinking': False parameter[3].
•Hybrid architecture in Qwen3.5-27B combines attention with Mamba2/SSM (State Space Models), enabling efficient long-context handling up to 262,144 tokens but causing recurrent eval in llama.cpp[1][3].
•Unsloth provides optimized inference settings for Qwen3.5: instruct mode uses temperature=0.7, top_p=0.8, presence_penalty=1.5; chat template fixes improve tool-calling universally across GGUF formats[4].
•Llama.cpp bug manifests as full prompt reevaluation per turn due to empty think block stripping; custom Jinja preserves empty tags while stripping content-filled ones[1][7].

Llama.cpp will merge official PR fixing Qwen3.5 reprocessing, reducing need for custom templates.

Multiple GitHub issues reference ongoing PRs that address the root cause even with vision/mmproj enabled, as confirmed by community testing[1][7].

Qwen3.5 instruct mode optimizations will standardize across inference engines like vLLM and SGLang.

Official docs recommend engine parameters for rope scaling and thinking disable, indicating broader ecosystem compatibility beyond llama.cpp[3].

2026-01

Qwen3.5 release with native multimodal agents and default thinking mode

2026-02

Qwen3.5-Plus snapshot with 1M context and adaptive tool use in thinking mode

2026-03

Llama.cpp issues #19858/#20225 opened reporting full prompt reprocessing bug

Factual claims are grounded in the sources below. Forward-looking analysis is AI-generated interpretation.

Weekly AI Recap

Read this week's curated digest of top AI events →

Same topic

Explore #chat-template

Same product