🤖Reddit r/MachineLearning•Mar 27, 2026Stalecollected in 2h

LoCoMo Audit: 6.4% Key Errors, Judge Passes 63% Wrongs

💡LoCoMo flawed: 6.4% key errors, judge OKs 63% wrongs—rethink memory benchmarks now

⚡ 30-Second TL;DR

What Changed

99 errors in answer key: hallucinations, temporal reasoning, speaker attribution

Why It Matters

Exposes flaws in popular long-context memory benchmarks, urging caution in leaderboard comparisons and pushing for better alternatives.

What To Do Next

Download locomo-audit repo and validate your long-memory model scores against documented fixes.

Who should care:Researchers & Academics

•99 errors in answer key: hallucinations, temporal reasoning, speaker attribution
•gpt-4o-mini judge passes 62.81% intentionally wrong but topically adjacent answers
•No standardized evaluation pipeline leads to irreproducible scores
•Full audit repo with documented errors and reproducible scripts available

Weekly AI Recap

Read this week's curated digest of top AI events →

Same topic

Explore #benchmark-audit

Same product

AI-curated news aggregator. All content rights belong to original publishers.
Original source: Reddit r/MachineLearning ↗