🐯Stalecollected in 15m

Benchmarking AI Toward Digital Scientists

Benchmarking AI Toward Digital Scientists
PostLinkedIn
🐯Read original on 虎嗅

💡New benchmarks like GPQA expose AI science limits; validate your models now

⚡ 30-Second TL;DR

What Changed

GPQA tests novel questions unsearchable online; o1 excels at multi-step logic.

Why It Matters

Redefines AI eval for science; boosts reliable LLM use in research pipelines.

What To Do Next

Run your LLM on GPQA Diamond subset to gauge scientific reasoning limits.

Who should care:Researchers & Academics

🧠 Deep Insight

Web-grounded analysis with 8 cited sources.

🔑 Enhanced Key Takeaways

  • GPQA-Diamond consists of exactly 198 multiple-choice questions in biology, physics, and chemistry, introduced in late 2023 by researchers from New York University and Anthropic.[1]
  • As of March 8, 2026, GPT-5.3 Codex leads the GPQA leaderboard with 91.5%, followed by Gemini 3 Pro Preview at 90.8% and GPT-5.2 Pro at 90.3%, surpassing PhD experts' 65% average.[6]
  • Gemini 3 Pro achieves 92% on GPQA Diamond, ahead of Grok 4 at 88% and Claude Opus at 87%, highlighting rapid frontier model convergence.[5]
📊 Competitor Analysis▸ Show
ModelGPQA Diamond ScoreProvider
GPT-5.3 Codex91.5%OpenAI
Gemini 3 Pro Preview90.8%Google
GPT-5.2 Pro90.3%OpenAI
Grok 488%xAI
Claude Opus87%Anthropic
Mistral57%Mistral AI

🛠️ Technical Deep Dive

  • GPQA-Diamond uses two evaluation modes: zero-shot chain-of-thought, where models explain reasoning steps without examples, and few-shot chain-of-thought with 5 example questions.[4]
  • Questions are validated such that expert validators answered correctly and no more than one out of three non-experts succeeded, ensuring unambiguous answers and high difficulty.[4]
  • Benchmark resists search-based solutions, with skilled non-experts reaching only 34% even with full web access.[2]

🔮 Future ImplicationsAI analysis grounded in cited sources

AI will automate 50%+ of graduate-level scientific Q&A by 2027
Top models already exceed PhD expert averages at 91.5% on GPQA-Diamond, enabling reliable research assistance in biology, physics, and chemistry.[6]
Scalable oversight via process supervision will become standard by 2027
GPQA chain-of-thought outputs enable analysis of reasoning chains, supporting explanation checking beyond accuracy scores.[1]
Frontier physics tasks like CritPt will remain unsolved (<20%) through 2026
State-of-the-art models score only 11.5% on CritPt despite GPQA success, indicating persistent limits in full research-scale challenges.[7]

Timeline

2023-10
GPQA benchmark introduced by Rein et al. with 448 graduate-level STEM questions, including Diamond subset.[1]
2023-12
GPQA-Diamond (198 hardest questions) released by NYU and Anthropic researchers.[1]
2025-12
o1 model achieves 80%+ on GPQA, exceeding experts' 65-70% as referenced in article.[article]
2026-03
GPT-5.3 Codex tops GPQA leaderboard at 91.5%, with Gemini 3 Pro at 90.8%.[6]
📰

Weekly AI Recap

Read this week's curated digest of top AI events →

👉Related Updates

AI-curated news aggregator. All content rights belong to original publishers.
Original source: 虎嗅