๐คReddit r/MachineLearningโขStalecollected in 10h
Refusal Alignment Eval Fails on Routing
๐กFlaws in alignment evals exposed via Chinese LLM censorship
โก 30-Second TL;DR
What Changed
Probes hit 100% even on null/random data; held-out generalization discriminates
Why It Matters
Challenges current alignment evals, urging causal interventions over refusal benchmarks for reliable safety assessment across labs.
What To Do Next
Test held-out probes and ablate routing on your aligned LLMs.
Who should care:Researchers & Academics
๐ฐ
Weekly AI Recap
Read this week's curated digest of top AI events โ
๐Related Updates
AI-curated news aggregator. All content rights belong to original publishers.
Original source: Reddit r/MachineLearning โ