LLMs Advance in Math but Still Fail Basics

Post LinkedIn

🇬🇧Read original on The Register - AI/ML

#math-benchmarks #llm-limitations #reasoning-evaluationorca-benchmark

💡ORCA results reveal top LLMs' math gaps—critical for building reliable quantitative AI tools

⚡ 30-Second TL;DR

What Changed

LLMs improved on ORCA math benchmark but not mastered

Why It Matters

Highlights need for better reasoning in LLMs, affecting reliability in quantitative apps. Practitioners should prioritize hybrid systems combining LLMs with calculators or verifiers.

What To Do Next

Run your LLM on ORCA benchmark via Hugging Face to quantify math weaknesses before production deployment.

Who should care:Researchers & Academics

🧠 Deep Insight

Web-grounded analysis with 6 cited sources.

🔑 Enhanced Key Takeaways

•ChatGPT 5.2 improved to 54.0% accuracy on ORCA, a 4.6 percentage point gain from prior tests[1].
•DeepSeek V3.2 saw dramatic gains in Biology & Chemistry from 10.5% to 43.9% accuracy on ORCA[1].
•Calculation errors now represent 39.8% of ORCA mistakes, up from 33.4%, while rounding errors decreased[1].
•Automated math benchmarks like Math-500 suffer from reliability issues due to exact string matching and benchmark contamination[2].

📊 Competitor Analysis▸ Show

Model	ORCA Math Score	Change from Prior	Other Domain Notes
Gemini 3 Flash	Top (C-grade, ~60-70%)	Improved	Math & Conversions: 93.2% (+10.2 pts)[1]
ChatGPT 5.2	54.0%	+4.6 pts	N/A[1]
Grok 4.1	60.2%	-2.6 pts	Losses in Health & Sports (-9 pts), Bio/Chem (-5.3 pts)[1]
Claude Sonnet 4.5	Improved (exact % not top)	Gained	N/A[1]
DeepSeek V3.2	Improved (not top)	Gained	Bio/Chem: +33.4 pts[1]

🛠️ Technical Deep Dive

•ORCA Benchmark comprises 500 practical math questions across domains like Math & Conversions, Biology & Chemistry, Health & Sports[1].
•LLMs exhibit evaluator variance on benchmarks like Math-500, with drops on perturbed versions (e.g., LLaMA3-70B: 59.80% to 22.22%) due to string matching over equivalence[2].
•Instruction tuning boosts small language models (SLMs) more than large ones on math tasks, with marginal gains for LLMs like Kimi-K2 (94.2% to 93.8%)[2].
•Fine-tuning on Orca-Math dataset (200k problems) with QLoRA/Spectrum on Llama-3.1-8B yields ~60% on GSM8K[4].

🔮 Future ImplicationsAI analysis grounded in cited sources

LLM math benchmarks will prioritize semantic equivalence over string matching by 2027

Current automated graders fail on syntactic variations and perturbations, prompting calls for robust evaluation methods[2].

Fine-tuning on math-specific datasets like Orca-Math will close 10-20% gaps on practical benchmarks

Spectrum fine-tuning on Llama-3.1-8B achieved 60% on GSM8K, outperforming QLoRA by 4%[4].

Model regressions like Grok 4.1 indicate trade-offs in multi-domain updates

Grok lost points in quantitative domains while others gained, suggesting prioritization of non-math capabilities[1].

⏳ Timeline

2025-11

Initial ORCA benchmark release showing all tested LLMs ≤63% on math

2026-01

arXiv paper highlights Math-500 reliability issues and instruction tuning limits

2026-02

ORCA second round: Gemini 3 Flash tops, Grok 4.1 regresses, most models improve

📎 Sources (6)

Factual claims are grounded in the sources below. Forward-looking analysis is AI-generated interpretation.

🇬🇧Read original article on The Register - AI/ML

📰

Weekly AI Recap

Read this week's curated digest of top AI events →

👉Related Updates

Same topic

Explore #math-benchmarks

Same product