🐯虎嗅•Stalecollected in 15m
Benchmarking AI Toward Digital Scientists

💡New benchmarks like GPQA expose AI science limits; validate your models now
⚡ 30-Second TL;DR
What Changed
GPQA tests novel questions unsearchable online; o1 excels at multi-step logic.
Why It Matters
Redefines AI eval for science; boosts reliable LLM use in research pipelines.
What To Do Next
Run your LLM on GPQA Diamond subset to gauge scientific reasoning limits.
Who should care:Researchers & Academics
🧠 Deep Insight
Web-grounded analysis with 8 cited sources.
🔑 Enhanced Key Takeaways
- •GPQA-Diamond consists of exactly 198 multiple-choice questions in biology, physics, and chemistry, introduced in late 2023 by researchers from New York University and Anthropic.[1]
- •As of March 8, 2026, GPT-5.3 Codex leads the GPQA leaderboard with 91.5%, followed by Gemini 3 Pro Preview at 90.8% and GPT-5.2 Pro at 90.3%, surpassing PhD experts' 65% average.[6]
- •Gemini 3 Pro achieves 92% on GPQA Diamond, ahead of Grok 4 at 88% and Claude Opus at 87%, highlighting rapid frontier model convergence.[5]
📊 Competitor Analysis▸ Show
| Model | GPQA Diamond Score | Provider |
|---|---|---|
| GPT-5.3 Codex | 91.5% | OpenAI |
| Gemini 3 Pro Preview | 90.8% | |
| GPT-5.2 Pro | 90.3% | OpenAI |
| Grok 4 | 88% | xAI |
| Claude Opus | 87% | Anthropic |
| Mistral | 57% | Mistral AI |
🛠️ Technical Deep Dive
- •GPQA-Diamond uses two evaluation modes: zero-shot chain-of-thought, where models explain reasoning steps without examples, and few-shot chain-of-thought with 5 example questions.[4]
- •Questions are validated such that expert validators answered correctly and no more than one out of three non-experts succeeded, ensuring unambiguous answers and high difficulty.[4]
- •Benchmark resists search-based solutions, with skilled non-experts reaching only 34% even with full web access.[2]
🔮 Future ImplicationsAI analysis grounded in cited sources
AI will automate 50%+ of graduate-level scientific Q&A by 2027
Top models already exceed PhD expert averages at 91.5% on GPQA-Diamond, enabling reliable research assistance in biology, physics, and chemistry.[6]
Scalable oversight via process supervision will become standard by 2027
GPQA chain-of-thought outputs enable analysis of reasoning chains, supporting explanation checking beyond accuracy scores.[1]
Frontier physics tasks like CritPt will remain unsolved (<20%) through 2026
State-of-the-art models score only 11.5% on CritPt despite GPQA success, indicating persistent limits in full research-scale challenges.[7]
⏳ Timeline
2023-10
GPQA benchmark introduced by Rein et al. with 448 graduate-level STEM questions, including Diamond subset.[1]
2023-12
GPQA-Diamond (198 hardest questions) released by NYU and Anthropic researchers.[1]
2025-12
o1 model achieves 80%+ on GPQA, exceeding experts' 65-70% as referenced in article.[article]
2026-03
GPT-5.3 Codex tops GPQA leaderboard at 91.5%, with Gemini 3 Pro at 90.8%.[6]
📎 Sources (8)
Factual claims are grounded in the sources below. Forward-looking analysis is AI-generated interpretation.
- intuitionlabs.ai — Gpqa Diamond AI Benchmark
- artificialanalysis.ai — Gpqa Diamond
- youtube.com — Watch
- vals.ai — Gpqa
- bracai.eu — Gpqa Benchmark Leaderboard
- pricepertoken.com — Gpqa
- vertu.com — AI Model Leaderboard 2026 Intelligence Speed Price Context a Complete Ranking Guide
- pluralsight.com — Best AI Models 2026 List
📰
Weekly AI Recap
Read this week's curated digest of top AI events →
👉Related Updates
AI-curated news aggregator. All content rights belong to original publishers.
Original source: 虎嗅 ↗