🦙Stalecollected in 65m

Google: Longer CoT Hurts Accuracy (-0.54)

PostLinkedIn
🦙Read original on Reddit r/LocalLLaMA
#chain-of-thought#deep-thinkingdeep-thinking-ratio-(dtr)

💡Google method cuts CoT compute 50%+ with higher accuracy—huge for local runs

⚡ 30-Second TL;DR

What Changed

Token length-accuracy correlation averages -0.54 on AIME/HMMT/GPQA benchmarks.

Why It Matters

Revolutionizes local inference by enabling early termination of poor reasoning paths, saving compute for more attempts. Benefits multi-agent systems and cloud tools significantly.

What To Do Next

Read the arXiv paper and prototype Think@n filtering in your local LLM inference code.

Who should care:Researchers & Academics

🧠 Deep Insight

Web-grounded analysis with 4 cited sources.

🔑 Enhanced Key Takeaways

  • Research collaboration involves University of Virginia alongside Google AI, challenging the conventional 'longer CoT is better' paradigm in LLM reasoning.[1]
  • DTR defined as proportion of deep-thinking tokens—those with predictions revised significantly in deeper layers—shows average positive correlation of r=0.683 with accuracy across models like DeepSeek-R1-70B and Qwen3-30B-Thinking.[1][2]
  • DTR metric outperforms length-based (r=-0.59) and confidence-based baselines consistently on benchmarks including AIME 2024/2025, HMMT 2025, and GPQA-Diamond.[3]
  • Settling threshold g (e.g., 0.5 or 0.75) critically impacts DTR-accuracy correlation strength, with stricter thresholds yielding more robust positive slopes than softer ones like g=0.25.[3]

🛠️ Technical Deep Dive

  • Deep-thinking tokens identified when internal predictions undergo significant revisions in deeper model layers before convergence, quantified via settling threshold g (optimal at 0.5-0.75) and depth fraction ρ for late layers.[3]
  • DTR computed as percentage of deep-thinking tokens in full sequence; robust to ρ variations but sensitive to g, where softer g=0.25 flattens correlation trends.[3]
  • Evaluated on models including GPT-OSS-120B, DeepSeek-R1-70B, Qwen3-30B-Thinking across AIME'24/'25, HMMT'25, GPQA-diamond benchmarks.[3]

🔮 Future ImplicationsAI analysis grounded in cited sources

Think@n will become standard for LLM inference scaling by reducing costs 50% without accuracy loss
It matches or exceeds Cons@n majority voting by early-filtering high-DTR paths, as validated across multiple benchmarks and models.[1][2]
DTR metric will replace token length for evaluating reasoning quality in benchmarks
DTR's superior r=0.683 correlation with accuracy outperforms length's negative r=-0.59 and other baselines consistently.[3]
Optimal CoT length follows inverted-U curve due to error accumulation beyond peak
Prior work like wu2025when-905 confirms performance deterioration past optimal length from overthinking, aligning with DTR findings.[3]

Timeline

2026-02
Google AI and University of Virginia release 'Measuring LLM Reasoning Effort via Deep-Thinking Tokens' paper on arXiv introducing DTR and Think@n.
📰

Weekly AI Recap

Read this week's curated digest of top AI events →

👉Related Updates

AI-curated news aggregator. All content rights belong to original publishers.
Original source: Reddit r/LocalLLaMA