๐Ÿฆ™Freshcollected in 4h

Claude Fails Elden Ring: No AGI Yet

PostLinkedIn
๐Ÿฆ™Read original on Reddit r/LocalLLaMA

๐Ÿ’กDebunks AGI hype with real Claude gaming failโ€”key for benchmark realists

โšก 30-Second TL;DR

What Changed

Critiques AGI claims by Jensen Huang and Marc Andreessen

Why It Matters

Sparks debate on AGI benchmarks, urging practitioners to test LLMs on novel tasks beyond standard evals.

What To Do Next

Test your LLM on zero-shot gaming tasks like Elden Ring navigation.

Who should care:Researchers & Academics

๐Ÿง  Deep Insight

AI-generated analysis for this event.

๐Ÿ”‘ Enhanced Key Takeaways

  • โ€ขThe failure of LLMs in complex, real-time environments like Elden Ring highlights the 'embodiment gap,' where models struggle with high-latency, non-deterministic visual feedback loops compared to static text-based reasoning.
  • โ€ขIndustry researchers distinguish between 'System 1' (fast, intuitive) and 'System 2' (slow, deliberative) reasoning; current architectures like Claude's struggle to maintain long-horizon planning in dynamic game environments without explicit neuro-symbolic integration.
  • โ€ขThe Reddit discourse reflects a broader shift in the AI community toward 'benchmarking by frustration,' where users test models against complex, multi-modal tasks to expose the limitations of current scaling laws.
๐Ÿ“Š Competitor Analysisโ–ธ Show
FeatureClaude 3.5 OpusGPT-4oGemini 1.5 Pro
Reasoning ArchitectureTransformer-based (CoT)Multimodal TransformerMixture-of-Experts
Context Window200k tokens128k tokens2M tokens
Game/Real-time Task CapabilityLow (Text-heavy)Low (Vision-limited)Moderate (Long-context)
Pricing$15/million input tokens$5/million input tokens$3.50/million input tokens

๐Ÿ› ๏ธ Technical Deep Dive

  • โ€ขCurrent LLM architectures lack a persistent 'world model' state, preventing them from maintaining spatial awareness in 3D environments like Elden Ring.
  • โ€ขThe failure to exit the room is attributed to the lack of a closed-loop feedback mechanism; the model receives a frame, but cannot predict the consequences of its actions (e.g., 'press W') on the game state.
  • โ€ขClaude Opus utilizes a standard Transformer decoder architecture optimized for text and code, which lacks the temporal memory required for continuous, real-time decision-making in gaming engines.

๐Ÿ”ฎ Future ImplicationsAI analysis grounded in cited sources

LLM-based agents will require dedicated 'World Model' layers to succeed in interactive gaming.
Without internal representations of physics and spatial constraints, models cannot perform the multi-step planning required for complex game navigation.
AGI definitions will shift from 'passing benchmarks' to 'demonstrating autonomous task completion in open-world environments'.
The failure in Elden Ring serves as a public litmus test that exposes the gap between high-scoring benchmarks and real-world utility.

โณ Timeline

2024-03
Anthropic releases Claude 3 Opus, setting new industry benchmarks for reasoning.
2024-10
Anthropic releases Claude 3.5 Sonnet, focusing on improved agentic capabilities.
2025-06
Anthropic introduces 'Computer Use' capabilities, allowing models to interact with desktop interfaces.
๐Ÿ“ฐ

Weekly AI Recap

Read this week's curated digest of top AI events โ†’

๐Ÿ‘‰Related Updates

AI-curated news aggregator. All content rights belong to original publishers.
Original source: Reddit r/LocalLLaMA โ†—