🐯虎嗅•Freshcollected in 34m
Humans outperform AI in rigorous mathematical research testing

💡See why top mathematicians like Terence Tao are testing AI's limits on truly novel research problems.
⚡ 30-Second TL;DR
What Changed
AI models often plagiarize or fail to cite research sources when solving complex problems.
Why It Matters
This study suggests that current LLMs are better at synthesizing existing knowledge than performing truly novel, high-level mathematical research.
What To Do Next
When using AI for research, verify all mathematical proofs independently, as models may hallucinate or fail to cite sources correctly.
Who should care:Researchers & Academics
🧠 Deep Insight
AI-generated analysis for this event.
🔑 Enhanced Key Takeaways
- •The First Proof project utilized a 'blind' evaluation methodology where problems were sourced from pre-publication manuscripts to prevent data contamination from training sets.
- •Analysis revealed that AI models frequently exhibit 'hallucinated citations,' where they generate plausible-sounding but non-existent mathematical papers to support incorrect proofs.
- •The study identified a specific failure mode termed 'logical drift,' where models maintain correct syntax and formal notation while losing the underlying semantic thread of the proof.
- •Human researchers outperformed AI specifically in 'proof synthesis,' the ability to connect disparate mathematical domains that have not been previously linked in the training corpus.
- •The ProofCouncil system demonstrated a higher success rate in problems involving standard algebraic geometry compared to those requiring novel combinatorial insights, suggesting a bias toward well-documented mathematical fields.
📊 Competitor Analysis▸ Show
| Feature | ProofCouncil | Lean/Isabelle (Formal Verification) | AlphaProof (Google DeepMind) |
|---|---|---|---|
| Primary Focus | Research-level discovery | Formal proof checking | Olympiad-level problem solving |
| Reasoning Engine | LLM-based heuristic | Rule-based/Interactive | Neuro-symbolic reinforcement |
| Success Rate (Research) | 60% | N/A (Manual) | ~40-50% (Estimated) |
| Pricing | Research/Academic | Open Source | Proprietary/Internal |
🛠️ Technical Deep Dive
- Architecture: ProofCouncil utilizes a multi-agent framework where a 'Prover' agent generates steps and a 'Critic' agent performs recursive verification against a formal library.
- Training Data: The model was fine-tuned on a curated dataset of LaTeX-formatted research papers from arXiv, specifically filtered for high-impact journals to reduce noise.
- Inference Mechanism: Employs a Monte Carlo Tree Search (MCTS) variant adapted for mathematical logic, allowing the model to backtrack when a proof branch leads to a contradiction.
- Constraint Handling: Integrates a symbolic solver to handle intermediate arithmetic and algebraic simplifications, offloading computation from the transformer backbone.
🔮 Future ImplicationsAI analysis grounded in cited sources
Mathematical research will shift toward 'AI-assisted verification' rather than 'AI-generated discovery' by 2028.
The high rate of hallucinated citations and logical drift makes current AI models unsuitable for autonomous research without human-in-the-loop validation.
Future benchmarks for LLMs will prioritize 'out-of-distribution' mathematical reasoning over standard dataset performance.
The failure of models on unpublished problems proves that current benchmarks are saturated and susceptible to data leakage.
⏳ Timeline
2025-03
First Proof project launched to establish a benchmark for research-level mathematics.
2025-11
Initial pilot testing reveals high rates of plagiarism in early model iterations.
2026-04
ProofCouncil system finalized and deployed for the rigorous testing phase.
2026-06
Publication of findings comparing AI performance against human researchers.
📰
Weekly AI Recap
Read this week's curated digest of top AI events →
👉Related Updates
AI-curated news aggregator. All content rights belong to original publishers.
Original source: 虎嗅 ↗



