Top AIs Score <1% on ARC-AGI-3, Humans 100%

💡Reveals why GPT-5.5 & Opus 4.7 flop (<1%) on true AGI test humans ace—key for reasoning research
⚡ 30-Second TL;DR
What Changed
GPT-5.5 scores 0.43%, Claude Opus 4.7 scores 0.18% on ARC-AGI-3
Why It Matters
Exposes limits of current frontier LLMs in core AGI reasoning, urging shift from scaling to novel architectures. May redirect research toward better abstraction and adaptation. Benchmarks like this validate human-like intelligence gaps.
What To Do Next
Download ARC-AGI-3 public dataset from GitHub and benchmark your model's novel reasoning performance.
🧠 Deep Insight
AI-generated analysis for this event.
🔑 Enhanced Key Takeaways
- •The ARC-AGI-3 benchmark was specifically designed by François Chollet to address the 'generalization gap' by requiring models to solve tasks with zero-shot adaptation to novel, non-language-based visual logic puzzles.
- •Researchers have identified that current Transformer-based architectures rely heavily on pattern matching from massive pre-training corpora, which actively hinders the 'system 2' reasoning required to derive new rules from minimal examples.
- •The 100% human performance metric is based on a control group of participants who demonstrated the ability to construct internal mental models of physical constraints, a capability currently absent in the latent space representations of LLMs.
📊 Competitor Analysis▸ Show
| Model | ARC-AGI-3 Score | Primary Architecture | Reasoning Approach |
|---|---|---|---|
| GPT-5.5 | 0.43% | Mixture-of-Experts (MoE) | Probabilistic Token Prediction |
| Claude Opus 4.7 | 0.18% | Dense Transformer | Chain-of-Thought (CoT) |
| Human Baseline | 100% | Biological Neural Network | Abstract Rule Induction |
🛠️ Technical Deep Dive
- •ARC-AGI-3 utilizes a grid-based environment where the state space is defined by discrete object transformations (rotation, translation, color mapping).
- •The benchmark forces a 'test-time adaptation' constraint, prohibiting the model from accessing external training data or fine-tuning during the evaluation phase.
- •Failure analysis indicates that models suffer from 'over-fitting to the training distribution,' where they attempt to map novel grid puzzles to known algorithmic patterns (e.g., pathfinding or sorting) rather than inferring the unique local rules of the specific environment.
- •The evaluation protocol requires the model to output the final grid state after observing only 1-3 demonstration examples, effectively testing for few-shot inductive reasoning.
🔮 Future ImplicationsAI analysis grounded in cited sources
⏳ Timeline
Weekly AI Recap
Read this week's curated digest of top AI events →
👉Related Updates
AI-curated news aggregator. All content rights belong to original publishers.
Original source: 虎嗅 ↗
