AI Updates Aggregator

🐯虎嗅•Feb 25, 2026Stalecollected in 13m

Open Models Fail AI Safety Benchmarks

Post LinkedIn

🐯Read original on 虎嗅

#ai-safety #benchmark #alignment-tax #open-sourceforesightsafety-bench

💡New safety bench: Claude crushes open models like DeepSeek—fix your alignment gaps

⚡ 30-Second TL;DR

What Changed

Claude-4.5 leads across 94 safety dimensions, near-zero violations

Why It Matters

Pushes Chinese open-source to pay 'alignment tax' for competitiveness; widens gaps in high-stakes AI-for-Science domains.

What To Do Next

Run your LLM on ForesightSafety Bench's 94 dimensions to benchmark safety.

Who should care:Researchers & Academics

🧠 Deep Insight

Web-grounded analysis with 8 cited sources.

🔑 Enhanced Key Takeaways

•ForesightSafety Bench was developed by the Beijing Institute of AI Safety and Governance, Beijing Key Laboratory of Safe AI and Superalignment, and Chinese Academy of Sciences[1][2][4].
•The benchmark evaluated 22 state-of-the-art LLMs, with GPT-5.2 achieving the lowest overall risk rate of 9.02%, excelling in areas like Path Planning Safety and Uncertainty-Aware Safety[2].
•Frontier models show elevated risks in Risky Agentic Autonomy, AI4Science Safety, Embodied AI Safety, Social AI Safety, and Existential Risks, despite strong fundamental safety performance[2][3].
•An 'inverse degradation' effect occurs where models optimized for complex reasoning, like DeepSeek-V3.2-Speciale, exhibit heightened vulnerabilities due to capability-safety trade-offs[3].

📊 Competitor Analysis▸ Show

Model	Overall Risk Rate	Key Strengths	Key Weaknesses
Claude-4.5 (Haiku/Sonnet)	Lowest in most categories	Exceptional resilience in Fundamental, Extended, and Industrial Safety	Not specified
GPT-5.2	9.02%	Path Planning (3.43%), Uncertainty-Aware (4.39%), Equipment Safety (4.64%)	Higher risks in some frontier areas
DeepSeek-V3.2-Speciale	Higher baseline vulnerabilities	Strong long-horizon reasoning	Elevated risks in multiple safety metrics
Gemini-3-Flash	Competitive behind Claude	Balance of capability and safety	Lags Claude in sub-categories

🛠️ Technical Deep Dive

•Framework structure: 7 Fundamental Safety pillars (e.g., Privacy/Data Misuse, Illegal Use, False Information, Physical/Psychological Harm, Hate/Expressive Harm, Sexual Content, Minor-related Harm), 5 Extended Safety pillars (e.g., Risky Agentic Autonomy, AI4Science Safety, Embodied AI Safety, Social AI Safety, Catastrophic/Existential Risks), and 8 Industrial Safety domains, totaling 94 risk subcategories[1][2][3][4].
•Accumulated tens of thousands of structured risk data points and assessment results for a data-driven, hierarchically clear evaluation[1][2][4].
•Risk rates calculated as lower values indicate better safety (fewer unsafe actions); evaluated via systematic testing on 22 LLMs[2].

🔮 Future ImplicationsAI analysis grounded in cited sources

Open-source models will require dedicated alignment scaling to match closed models' safety thresholds by 2027

Benchmark reveals capability gains alone do not improve safety, imposing an 'alignment tax' that demands targeted investment in open models[1][3].

Global AI safety benchmarks will increasingly converge on East-West standards

ForesightSafety Bench mirrors Western frameworks in risk categories, fostering shared evaluation practices despite geopolitical differences[1][4].

Existential risk evaluations will become standard in LLM benchmarks

High vulnerabilities in Loss of Human Agency and Power Seeking across models highlight need for routine frontier risk testing[2][3].

⏳ Timeline

2026-02

ForesightSafety Bench paper published on arXiv, introducing 94-risk framework and evaluating 22 LLMs

2026-02-15

ForesightSafety Bench listed in adversarial AI papers GitHub repository

2026-02

Leaderboard released on official ForesightSafety Bench site with Claude-4.5 topping safety rankings

📎 Sources (8)

Factual claims are grounded in the sources below. Forward-looking analysis is AI-generated interpretation.

🐯Read original article on 虎嗅

📰

Weekly AI Recap

Read this week's curated digest of top AI events →

👉Related Updates

Same topic

Explore #ai-safety

Same product

Anthropic Builds Unreleasable Risky Model

Ben's Bites•Apr 9

🐯

DJI Founder Wang Tao Learns Humility After 8 Years

虎嗅•Apr 9

AI-curated news aggregator. All content rights belong to original publishers.
Original source: 虎嗅 ↗