LLMs Advance in Math but Still Fail Basics

๐กORCA results reveal top LLMs' math gapsโcritical for building reliable quantitative AI tools
โก 30-Second TL;DR
What Changed
LLMs improved on ORCA math benchmark but not mastered
Why It Matters
Highlights need for better reasoning in LLMs, affecting reliability in quantitative apps. Practitioners should prioritize hybrid systems combining LLMs with calculators or verifiers.
What To Do Next
Run your LLM on ORCA benchmark via Hugging Face to quantify math weaknesses before production deployment.
๐ง Deep Insight
Web-grounded analysis with 6 cited sources.
๐ Enhanced Key Takeaways
- โขChatGPT 5.2 improved to 54.0% accuracy on ORCA, a 4.6 percentage point gain from prior tests[1].
- โขDeepSeek V3.2 saw dramatic gains in Biology & Chemistry from 10.5% to 43.9% accuracy on ORCA[1].
- โขCalculation errors now represent 39.8% of ORCA mistakes, up from 33.4%, while rounding errors decreased[1].
- โขAutomated math benchmarks like Math-500 suffer from reliability issues due to exact string matching and benchmark contamination[2].
๐ Competitor Analysisโธ Show
| Model | ORCA Math Score | Change from Prior | Other Domain Notes |
|---|---|---|---|
| Gemini 3 Flash | Top (C-grade, ~60-70%) | Improved | Math & Conversions: 93.2% (+10.2 pts)[1] |
| ChatGPT 5.2 | 54.0% | +4.6 pts | N/A[1] |
| Grok 4.1 | 60.2% | -2.6 pts | Losses in Health & Sports (-9 pts), Bio/Chem (-5.3 pts)[1] |
| Claude Sonnet 4.5 | Improved (exact % not top) | Gained | N/A[1] |
| DeepSeek V3.2 | Improved (not top) | Gained | Bio/Chem: +33.4 pts[1] |
๐ ๏ธ Technical Deep Dive
- โขORCA Benchmark comprises 500 practical math questions across domains like Math & Conversions, Biology & Chemistry, Health & Sports[1].
- โขLLMs exhibit evaluator variance on benchmarks like Math-500, with drops on perturbed versions (e.g., LLaMA3-70B: 59.80% to 22.22%) due to string matching over equivalence[2].
- โขInstruction tuning boosts small language models (SLMs) more than large ones on math tasks, with marginal gains for LLMs like Kimi-K2 (94.2% to 93.8%)[2].
- โขFine-tuning on Orca-Math dataset (200k problems) with QLoRA/Spectrum on Llama-3.1-8B yields ~60% on GSM8K[4].
๐ฎ Future ImplicationsAI analysis grounded in cited sources
โณ Timeline
๐ Sources (6)
Factual claims are grounded in the sources below. Forward-looking analysis is AI-generated interpretation.
Weekly AI Recap
Read this week's curated digest of top AI events โ
๐Related Updates
AI-curated news aggregator. All content rights belong to original publishers.
Original source: The Register - AI/ML โ