January Mirror, an evidence-grounded clinical reasoning system, scored 87.5% on a 120-question 2025 endocrinology board-style exam, outperforming human experts (62.3%) and frontier LLMs like GPT-5.2 (74.6%). It excelled on the hardest questions (76.7% accuracy) under closed-evidence constraints without web access. Outputs featured traceable citations from guidelines with 100% accuracy.
Key Points
- 1.87.5% accuracy (105/120) vs GPT-5.2's 74.6% and humans' 62.3%
- 2.76.7% on 30 hardest questions (human <50%)
- 3.74.2% outputs cited guideline sources; 100% citation accuracy verified
- 4.Top-2 accuracy 92.5% vs GPT-5.2's 85.25%
- 5.Closed-evidence setup outperformed LLMs with web access
Impact Analysis
Demonstrates curated evidence layers enable superior subspecialty reasoning over general LLMs with web tools, boosting auditability for clinical use. Highlights potential for specialized AI in medicine beyond broad models.
Technical Details
Mirror integrates curated endocrinology/cardiometabolic corpus with structured reasoning for evidence-linked outputs. Operated without external retrieval, unlike comparators. Achieved high traceability with guideline citations.