🐯Freshcollected in 23m

Top AIs Score <1% on ARC-AGI-3, Humans 100%

Top AIs Score <1% on ARC-AGI-3, Humans 100%
PostLinkedIn
🐯Read original on 虎嗅

💡Reveals why GPT-5.5 & Opus 4.7 flop (<1%) on true AGI test humans ace—key for reasoning research

⚡ 30-Second TL;DR

What Changed

GPT-5.5 scores 0.43%, Claude Opus 4.7 scores 0.18% on ARC-AGI-3

Why It Matters

Exposes limits of current frontier LLMs in core AGI reasoning, urging shift from scaling to novel architectures. May redirect research toward better abstraction and adaptation. Benchmarks like this validate human-like intelligence gaps.

What To Do Next

Download ARC-AGI-3 public dataset from GitHub and benchmark your model's novel reasoning performance.

Who should care:Researchers & Academics

🧠 Deep Insight

AI-generated analysis for this event.

🔑 Enhanced Key Takeaways

  • The ARC-AGI-3 benchmark was specifically designed by François Chollet to address the 'generalization gap' by requiring models to solve tasks with zero-shot adaptation to novel, non-language-based visual logic puzzles.
  • Researchers have identified that current Transformer-based architectures rely heavily on pattern matching from massive pre-training corpora, which actively hinders the 'system 2' reasoning required to derive new rules from minimal examples.
  • The 100% human performance metric is based on a control group of participants who demonstrated the ability to construct internal mental models of physical constraints, a capability currently absent in the latent space representations of LLMs.
📊 Competitor Analysis▸ Show
ModelARC-AGI-3 ScorePrimary ArchitectureReasoning Approach
GPT-5.50.43%Mixture-of-Experts (MoE)Probabilistic Token Prediction
Claude Opus 4.70.18%Dense TransformerChain-of-Thought (CoT)
Human Baseline100%Biological Neural NetworkAbstract Rule Induction

🛠️ Technical Deep Dive

  • ARC-AGI-3 utilizes a grid-based environment where the state space is defined by discrete object transformations (rotation, translation, color mapping).
  • The benchmark forces a 'test-time adaptation' constraint, prohibiting the model from accessing external training data or fine-tuning during the evaluation phase.
  • Failure analysis indicates that models suffer from 'over-fitting to the training distribution,' where they attempt to map novel grid puzzles to known algorithmic patterns (e.g., pathfinding or sorting) rather than inferring the unique local rules of the specific environment.
  • The evaluation protocol requires the model to output the final grid state after observing only 1-3 demonstration examples, effectively testing for few-shot inductive reasoning.

🔮 Future ImplicationsAI analysis grounded in cited sources

LLM-based architectures will be abandoned for AGI research in favor of neuro-symbolic systems.
The persistent failure of Transformers on ARC-AGI-3 suggests that statistical token prediction is fundamentally incompatible with the abstract rule-based reasoning required for general intelligence.
Benchmark saturation will shift from language-based tests to physical-world simulation environments.
As language benchmarks reach human-level performance, the industry will pivot to ARC-style benchmarks to measure true cognitive adaptability rather than linguistic fluency.

Timeline

2019-11
François Chollet releases the original ARC (Abstraction and Reasoning Corpus) benchmark.
2024-06
The ARC Prize competition is launched to incentivize progress on AGI-level reasoning.
2026-03
ARC-AGI-3 is finalized, introducing significantly more complex, unseen environments to challenge frontier models.
📰

Weekly AI Recap

Read this week's curated digest of top AI events →

👉Related Updates

AI-curated news aggregator. All content rights belong to original publishers.
Original source: 虎嗅