Google: Longer CoT Hurts Accuracy (-0.54)

Post LinkedIn

🦙Read original on Reddit r/LocalLLaMA

#chain-of-thought #deep-thinkingdeep-thinking-ratio-(dtr)

💡Google method cuts CoT compute 50%+ with higher accuracy—huge for local runs

⚡ 30-Second TL;DR

What Changed

Token length-accuracy correlation averages -0.54 on AIME/HMMT/GPQA benchmarks.

Why It Matters

Revolutionizes local inference by enabling early termination of poor reasoning paths, saving compute for more attempts. Benefits multi-agent systems and cloud tools significantly.

What To Do Next

Read the arXiv paper and prototype Think@n filtering in your local LLM inference code.

Who should care:Researchers & Academics

🧠 Deep Insight

Web-grounded analysis with 4 cited sources.

🔑 Enhanced Key Takeaways

•Research collaboration involves University of Virginia alongside Google AI, challenging the conventional 'longer CoT is better' paradigm in LLM reasoning.[1]
•DTR defined as proportion of deep-thinking tokens—those with predictions revised significantly in deeper layers—shows average positive correlation of r=0.683 with accuracy across models like DeepSeek-R1-70B and Qwen3-30B-Thinking.[1][2]
•DTR metric outperforms length-based (r=-0.59) and confidence-based baselines consistently on benchmarks including AIME 2024/2025, HMMT 2025, and GPQA-Diamond.[3]
•Settling threshold g (e.g., 0.5 or 0.75) critically impacts DTR-accuracy correlation strength, with stricter thresholds yielding more robust positive slopes than softer ones like g=0.25.[3]

🛠️ Technical Deep Dive

•Deep-thinking tokens identified when internal predictions undergo significant revisions in deeper model layers before convergence, quantified via settling threshold g (optimal at 0.5-0.75) and depth fraction ρ for late layers.[3]
•DTR computed as percentage of deep-thinking tokens in full sequence; robust to ρ variations but sensitive to g, where softer g=0.25 flattens correlation trends.[3]
•Evaluated on models including GPT-OSS-120B, DeepSeek-R1-70B, Qwen3-30B-Thinking across AIME'24/'25, HMMT'25, GPQA-diamond benchmarks.[3]

🔮 Future ImplicationsAI analysis grounded in cited sources

Think@n will become standard for LLM inference scaling by reducing costs 50% without accuracy loss

It matches or exceeds Cons@n majority voting by early-filtering high-DTR paths, as validated across multiple benchmarks and models.[1][2]

DTR metric will replace token length for evaluating reasoning quality in benchmarks

DTR's superior r=0.683 correlation with accuracy outperforms length's negative r=-0.59 and other baselines consistently.[3]

Optimal CoT length follows inverted-U curve due to error accumulation beyond peak

Prior work like wu2025when-905 confirms performance deterioration past optimal length from overthinking, aligning with DTR findings.[3]

⏳ Timeline

2026-02

Google AI and University of Virginia release 'Measuring LLM Reasoning Effort via Deep-Thinking Tokens' paper on arXiv introducing DTR and Think@n.

📎 Sources (4)

Factual claims are grounded in the sources below. Forward-looking analysis is AI-generated interpretation.

🦙Read original article on Reddit r/LocalLLaMA

📰

Weekly AI Recap

Read this week's curated digest of top AI events →

👉Related Updates

Same topic

Explore #chain-of-thought

Same product