Benchmarking AI Toward Digital Scientists

💡New benchmarks like GPQA expose AI science limits; validate your models now

⚡ 30-Second TL;DR

What Changed

GPQA tests novel questions unsearchable online; o1 excels at multi-step logic.

Why It Matters

Redefines AI eval for science; boosts reliable LLM use in research pipelines.

What To Do Next

Run your LLM on GPQA Diamond subset to gauge scientific reasoning limits.

Who should care:Researchers & Academics

Web-grounded analysis with 8 cited sources.

•GPQA-Diamond consists of exactly 198 multiple-choice questions in biology, physics, and chemistry, introduced in late 2023 by researchers from New York University and Anthropic.[1]
•As of March 8, 2026, GPT-5.3 Codex leads the GPQA leaderboard with 91.5%, followed by Gemini 3 Pro Preview at 90.8% and GPT-5.2 Pro at 90.3%, surpassing PhD experts' 65% average.[6]
•Gemini 3 Pro achieves 92% on GPQA Diamond, ahead of Grok 4 at 88% and Claude Opus at 87%, highlighting rapid frontier model convergence.[5]

📊 Competitor Analysis▸ Show

•GPQA-Diamond uses two evaluation modes: zero-shot chain-of-thought, where models explain reasoning steps without examples, and few-shot chain-of-thought with 5 example questions.[4]
•Questions are validated such that expert validators answered correctly and no more than one out of three non-experts succeeded, ensuring unambiguous answers and high difficulty.[4]
•Benchmark resists search-based solutions, with skilled non-experts reaching only 34% even with full web access.[2]

AI will automate 50%+ of graduate-level scientific Q&A by 2027

Top models already exceed PhD expert averages at 91.5% on GPQA-Diamond, enabling reliable research assistance in biology, physics, and chemistry.[6]

Scalable oversight via process supervision will become standard by 2027

GPQA chain-of-thought outputs enable analysis of reasoning chains, supporting explanation checking beyond accuracy scores.[1]

Frontier physics tasks like CritPt will remain unsolved (<20%) through 2026

State-of-the-art models score only 11.5% on CritPt despite GPQA success, indicating persistent limits in full research-scale challenges.[7]

2023-10

GPQA benchmark introduced by Rein et al. with 448 graduate-level STEM questions, including Diamond subset.[1]

2023-12

GPQA-Diamond (198 hardest questions) released by NYU and Anthropic researchers.[1]

2025-12

o1 model achieves 80%+ on GPQA, exceeding experts' 65-70% as referenced in article.[article]

2026-03

GPT-5.3 Codex tops GPQA leaderboard at 91.5%, with Gemini 3 Pro at 90.8%.[6]

Factual claims are grounded in the sources below. Forward-looking analysis is AI-generated interpretation.

Weekly AI Recap

Read this week's curated digest of top AI events →

Same topic

Explore #benchmarks

Same product