Fix Qwen 3.5 Instruct Prompt Reprocessing
๐กStops wasteful prompt reprocessing in Qwen 3.5 instruct โ faster local chats.
โก 30-Second TL;DR
What Changed
Fixes reprocessing of last message in instruct mode
Why It Matters
Improves efficiency in long conversations by preventing unnecessary prompt reprocessing, saving compute in local inference setups.
What To Do Next
Save template as chat_template.jinja and run llama.cpp with --chat-template-file chat_template.jinja.
๐ง Deep Insight
Web-grounded analysis with 8 cited sources.
๐ Enhanced Key Takeaways
- โขQwen 3.5 employs a hybrid attention + Mamba2/SSM architecture in models like 27B, contributing to the recurrent nature that triggers full prompt reprocessing in llama.cpp[1].
- โขOfficial Hugging Face documentation specifies disabling thinking mode via 'chat_template_kwargs': {"enable_thinking": False} in API calls for direct responses without <think> blocks[3].
- โขThe issue is corroborated in multiple llama.cpp GitHub issues (#20225 and #19858), with workarounds like disabling mmproj temporarily resolving reprocessing even for vision-enabled setups[1][7].
๐ ๏ธ Technical Deep Dive
- โขQwen3.5 models default to thinking mode, outputting <think>\n...\n\n</think> before final responses; instruct mode requires explicit 'enable_thinking': False parameter[3].
- โขHybrid architecture in Qwen3.5-27B combines attention with Mamba2/SSM (State Space Models), enabling efficient long-context handling up to 262,144 tokens but causing recurrent eval in llama.cpp[1][3].
- โขUnsloth provides optimized inference settings for Qwen3.5: instruct mode uses temperature=0.7, top_p=0.8, presence_penalty=1.5; chat template fixes improve tool-calling universally across GGUF formats[4].
- โขLlama.cpp bug manifests as full prompt reevaluation per turn due to empty think block stripping; custom Jinja preserves empty <think> tags while stripping content-filled ones[1][7].
๐ฎ Future ImplicationsAI analysis grounded in cited sources
โณ Timeline
๐ Sources (8)
Factual claims are grounded in the sources below. Forward-looking analysis is AI-generated interpretation.
Weekly AI Recap
Read this week's curated digest of top AI events โ
๐Related Updates
AI-curated news aggregator. All content rights belong to original publishers.
Original source: Reddit r/LocalLLaMA โ