AI Updates Aggregator

🐯虎嗅•May 2, 2026Freshcollected in 23m

Top AIs Score <1% on ARC-AGI-3, Humans 100%

Post LinkedIn

🐯Read original on 虎嗅

#agi-benchmark #reasoning-failures #world-modelingarc-agi-3

💡Reveals why GPT-5.5 & Opus 4.7 flop (<1%) on true AGI test humans ace—key for reasoning research

⚡ 30-Second TL;DR

What Changed

GPT-5.5 scores 0.43%, Claude Opus 4.7 scores 0.18% on ARC-AGI-3

Why It Matters

Exposes limits of current frontier LLMs in core AGI reasoning, urging shift from scaling to novel architectures. May redirect research toward better abstraction and adaptation. Benchmarks like this validate human-like intelligence gaps.

What To Do Next

Download ARC-AGI-3 public dataset from GitHub and benchmark your model's novel reasoning performance.

Who should care:Researchers & Academics

🧠 Deep Insight

AI-generated analysis for this event.

🔑 Enhanced Key Takeaways

•The ARC-AGI-3 benchmark was specifically designed by François Chollet to address the 'generalization gap' by requiring models to solve tasks with zero-shot adaptation to novel, non-language-based visual logic puzzles.
•Researchers have identified that current Transformer-based architectures rely heavily on pattern matching from massive pre-training corpora, which actively hinders the 'system 2' reasoning required to derive new rules from minimal examples.
•The 100% human performance metric is based on a control group of participants who demonstrated the ability to construct internal mental models of physical constraints, a capability currently absent in the latent space representations of LLMs.

📊 Competitor Analysis▸ Show

Model	ARC-AGI-3 Score	Primary Architecture	Reasoning Approach
GPT-5.5	0.43%	Mixture-of-Experts (MoE)	Probabilistic Token Prediction
Claude Opus 4.7	0.18%	Dense Transformer	Chain-of-Thought (CoT)
Human Baseline	100%	Biological Neural Network	Abstract Rule Induction

🛠️ Technical Deep Dive

•ARC-AGI-3 utilizes a grid-based environment where the state space is defined by discrete object transformations (rotation, translation, color mapping).
•The benchmark forces a 'test-time adaptation' constraint, prohibiting the model from accessing external training data or fine-tuning during the evaluation phase.
•Failure analysis indicates that models suffer from 'over-fitting to the training distribution,' where they attempt to map novel grid puzzles to known algorithmic patterns (e.g., pathfinding or sorting) rather than inferring the unique local rules of the specific environment.
•The evaluation protocol requires the model to output the final grid state after observing only 1-3 demonstration examples, effectively testing for few-shot inductive reasoning.

🔮 Future ImplicationsAI analysis grounded in cited sources

LLM-based architectures will be abandoned for AGI research in favor of neuro-symbolic systems.

The persistent failure of Transformers on ARC-AGI-3 suggests that statistical token prediction is fundamentally incompatible with the abstract rule-based reasoning required for general intelligence.

Benchmark saturation will shift from language-based tests to physical-world simulation environments.

As language benchmarks reach human-level performance, the industry will pivot to ARC-style benchmarks to measure true cognitive adaptability rather than linguistic fluency.

⏳ Timeline

2019-11

François Chollet releases the original ARC (Abstraction and Reasoning Corpus) benchmark.

2024-06

The ARC Prize competition is launched to incentivize progress on AGI-level reasoning.

2026-03

ARC-AGI-3 is finalized, introducing significantly more complex, unseen environments to challenge frontier models.

🐯Read original article on 虎嗅

📰

Weekly AI Recap

Read this week's curated digest of top AI events →

👉Related Updates

Same topic

Explore #agi-benchmark

Same product

AI-curated news aggregator. All content rights belong to original publishers.
Original source: 虎嗅 ↗

⚡ 30-Second TL;DR

🧠 Deep Insight

🔑 Enhanced Key Takeaways

🛠️ Technical Deep Dive

🔮 Future ImplicationsAI analysis grounded in cited sources

⏳ Timeline

👉Related Updates

90s VCs Redefine China AI Funding

Expert Slams SpaceX $2T IPO Valuation

Web Coding: Services Beat Products

Harvard-Backed Free Digital Literacy for Kids