🤖Reddit r/MachineLearning•Mar 23, 2026Stalecollected in 10h

Refusal Alignment Eval Fails on Routing

💡Flaws in alignment evals exposed via Chinese LLM censorship

⚡ 30-Second TL;DR

What Changed

Probes hit 100% even on null/random data; held-out generalization discriminates

Why It Matters

Challenges current alignment evals, urging causal interventions over refusal benchmarks for reliable safety assessment across labs.

What To Do Next

Test held-out probes and ablate routing on your aligned LLMs.

Who should care:Researchers & Academics

Weekly AI Recap

Read this week's curated digest of top AI events →

Same topic

Explore #alignment

Same product