LiveMedBench offers weekly updated real-world clinical cases for LLM evaluation, avoiding contamination via temporal separation. Multi-agent curation ensures integrity; automated rubric evaluation aligns with experts better than alternatives. Tests reveal top LLMs at 39.2%, highlighting contextual gaps.
Key Points
- 1.2,756 cases across 38 specialties
- 2.Rubric decomposes physician responses
- 3.84% models degrade post-cutoff
Impact Analysis
Mitigates eval flaws, exposes contamination risks for reliable medical AI assessment.
Technical Details
Harvested from online communities; 16,702 criteria; error analysis shows application failures.