๐Ÿ‡ฌ๐Ÿ‡งStalecollected in 20m

LLMs Advance in Math but Still Fail Basics

LLMs Advance in Math but Still Fail Basics
PostLinkedIn
๐Ÿ‡ฌ๐Ÿ‡งRead original on The Register - AI/ML

๐Ÿ’กORCA results reveal top LLMs' math gapsโ€”critical for building reliable quantitative AI tools

โšก 30-Second TL;DR

What Changed

LLMs improved on ORCA math benchmark but not mastered

Why It Matters

Highlights need for better reasoning in LLMs, affecting reliability in quantitative apps. Practitioners should prioritize hybrid systems combining LLMs with calculators or verifiers.

What To Do Next

Run your LLM on ORCA benchmark via Hugging Face to quantify math weaknesses before production deployment.

Who should care:Researchers & Academics

๐Ÿง  Deep Insight

Web-grounded analysis with 6 cited sources.

๐Ÿ”‘ Enhanced Key Takeaways

  • โ€ขChatGPT 5.2 improved to 54.0% accuracy on ORCA, a 4.6 percentage point gain from prior tests[1].
  • โ€ขDeepSeek V3.2 saw dramatic gains in Biology & Chemistry from 10.5% to 43.9% accuracy on ORCA[1].
  • โ€ขCalculation errors now represent 39.8% of ORCA mistakes, up from 33.4%, while rounding errors decreased[1].
  • โ€ขAutomated math benchmarks like Math-500 suffer from reliability issues due to exact string matching and benchmark contamination[2].
๐Ÿ“Š Competitor Analysisโ–ธ Show
ModelORCA Math ScoreChange from PriorOther Domain Notes
Gemini 3 FlashTop (C-grade, ~60-70%)ImprovedMath & Conversions: 93.2% (+10.2 pts)[1]
ChatGPT 5.254.0%+4.6 ptsN/A[1]
Grok 4.160.2%-2.6 ptsLosses in Health & Sports (-9 pts), Bio/Chem (-5.3 pts)[1]
Claude Sonnet 4.5Improved (exact % not top)GainedN/A[1]
DeepSeek V3.2Improved (not top)GainedBio/Chem: +33.4 pts[1]

๐Ÿ› ๏ธ Technical Deep Dive

  • โ€ขORCA Benchmark comprises 500 practical math questions across domains like Math & Conversions, Biology & Chemistry, Health & Sports[1].
  • โ€ขLLMs exhibit evaluator variance on benchmarks like Math-500, with drops on perturbed versions (e.g., LLaMA3-70B: 59.80% to 22.22%) due to string matching over equivalence[2].
  • โ€ขInstruction tuning boosts small language models (SLMs) more than large ones on math tasks, with marginal gains for LLMs like Kimi-K2 (94.2% to 93.8%)[2].
  • โ€ขFine-tuning on Orca-Math dataset (200k problems) with QLoRA/Spectrum on Llama-3.1-8B yields ~60% on GSM8K[4].

๐Ÿ”ฎ Future ImplicationsAI analysis grounded in cited sources

LLM math benchmarks will prioritize semantic equivalence over string matching by 2027
Current automated graders fail on syntactic variations and perturbations, prompting calls for robust evaluation methods[2].
Fine-tuning on math-specific datasets like Orca-Math will close 10-20% gaps on practical benchmarks
Spectrum fine-tuning on Llama-3.1-8B achieved 60% on GSM8K, outperforming QLoRA by 4%[4].
Model regressions like Grok 4.1 indicate trade-offs in multi-domain updates
Grok lost points in quantitative domains while others gained, suggesting prioritization of non-math capabilities[1].

โณ Timeline

2025-11
Initial ORCA benchmark release showing all tested LLMs โ‰ค63% on math
2026-01
arXiv paper highlights Math-500 reliability issues and instruction tuning limits
2026-02
ORCA second round: Gemini 3 Flash tops, Grok 4.1 regresses, most models improve
๐Ÿ“ฐ

Weekly AI Recap

Read this week's curated digest of top AI events โ†’

๐Ÿ‘‰Related Updates

AI-curated news aggregator. All content rights belong to original publishers.
Original source: The Register - AI/ML โ†—