Google: Longer CoT Hurts Accuracy (-0.54)
💡Google method cuts CoT compute 50%+ with higher accuracy—huge for local runs
⚡ 30-Second TL;DR
What Changed
Token length-accuracy correlation averages -0.54 on AIME/HMMT/GPQA benchmarks.
Why It Matters
Revolutionizes local inference by enabling early termination of poor reasoning paths, saving compute for more attempts. Benefits multi-agent systems and cloud tools significantly.
What To Do Next
Read the arXiv paper and prototype Think@n filtering in your local LLM inference code.
🧠 Deep Insight
Web-grounded analysis with 4 cited sources.
🔑 Enhanced Key Takeaways
- •Research collaboration involves University of Virginia alongside Google AI, challenging the conventional 'longer CoT is better' paradigm in LLM reasoning.[1]
- •DTR defined as proportion of deep-thinking tokens—those with predictions revised significantly in deeper layers—shows average positive correlation of r=0.683 with accuracy across models like DeepSeek-R1-70B and Qwen3-30B-Thinking.[1][2]
- •DTR metric outperforms length-based (r=-0.59) and confidence-based baselines consistently on benchmarks including AIME 2024/2025, HMMT 2025, and GPQA-Diamond.[3]
- •Settling threshold g (e.g., 0.5 or 0.75) critically impacts DTR-accuracy correlation strength, with stricter thresholds yielding more robust positive slopes than softer ones like g=0.25.[3]
🛠️ Technical Deep Dive
- •Deep-thinking tokens identified when internal predictions undergo significant revisions in deeper model layers before convergence, quantified via settling threshold g (optimal at 0.5-0.75) and depth fraction ρ for late layers.[3]
- •DTR computed as percentage of deep-thinking tokens in full sequence; robust to ρ variations but sensitive to g, where softer g=0.25 flattens correlation trends.[3]
- •Evaluated on models including GPT-OSS-120B, DeepSeek-R1-70B, Qwen3-30B-Thinking across AIME'24/'25, HMMT'25, GPQA-diamond benchmarks.[3]
🔮 Future ImplicationsAI analysis grounded in cited sources
⏳ Timeline
📎 Sources (4)
Factual claims are grounded in the sources below. Forward-looking analysis is AI-generated interpretation.
- marktechpost.com — A New Google AI Research Proposes Deep Thinking Ratio to Improve LLM Accuracy While Cutting Total Inference Costs by Half
- aidevsignals.com — Google AI Research Proposes Deep Thinking Ratio to Improve LLM Accuracy
- arXiv — 2602
- thiqaflow.com — A New Google AI Research Proposes Deep Thinking Ratio to Improve LLM Accuracy While Cutting Total Inference Costs by Half
Weekly AI Recap
Read this week's curated digest of top AI events →
👉Related Updates
AI-curated news aggregator. All content rights belong to original publishers.
Original source: Reddit r/LocalLLaMA ↗