Fix Qwen3.5 Overthinking with Budget Flags

💡End Qwen3.5 endless thinking loops with 2 simple llama.cpp flags

⚡ 30-Second TL;DR

What Changed

--reasoning-budget 4096 stops at token threshold

Why It Matters

Quick hack improves usability of Qwen3.5 in inference engines, preventing verbose outputs. Useful for real-time local deployments.

What To Do Next

Run llama-server with --reasoning-budget 1024 --reasoning-budget-message '. Okay enough thinking.' for Qwen3.5.

Who should care:Developers & AI Engineers

Web-grounded analysis with 6 cited sources.

•Llama.cpp's reasoning budget uses a sampler mechanism to count tokens during reasoning phases and terminates generation upon reaching the set threshold, enhancing control over o1-like chain-of-thought processes[2].
•The feature addresses KV cache inconsistencies in MLX on Mac, recommending llama.cpp over MLX for stable performance during branched conversations[1].
•Recent llama.cpp updates fixed bugs in Qwen3-Coder-Next GGUF calculations that caused looping issues, requiring model re-downloads for compatibility[3].

•Reasoning budget implemented via sampler token counting in llama.cpp, replacing prior stub; enforces termination at budget limit like 4096 tokens[2].
•Tested effectively on Qwen3.5-35B MoE variant (A3B active params), where budgets improved decision efficiency without quality loss[2][1].
•ROCm support confirmed for AMD GPUs like Radeon 7900 XTX, though MXFP4 quantization limited to NVIDIA[1].

Llama.cpp reasoning budgets will standardize in local inference frameworks by mid-2026

Recent sampler-based implementation and successful Qwen3.5 tests indicate maturing feature adoption across backends like vLLM[2][3].

MoE models like Qwen3.5-A3B will see 20-30% inference speed gains with budget controls

User benchmarks show tuned setups achieving 27-29 t/s on Qwen3.5-122B, with budgets mitigating overthinking overhead[5].

2026-02

Llama.cpp fixes Qwen3-Coder-Next GGUF calculation bug causing loops

2026-02

Llama.cpp improves tool-calling parsing for Qwen3 models

2026-03

Llama.cpp introduces full reasoning budget via sampler mechanism

Factual claims are grounded in the sources below. Forward-looking analysis is AI-generated interpretation.

Weekly AI Recap

Read this week's curated digest of top AI events →

Same topic

Explore #reasoning-fix

Same product