🔥36氪•Stalecollected in 22m
AI Video Models Fail Counting 1-10 Test
💡Core limits of video AI exposed—essential for devs eyeing reliable generation.
⚡ 30-Second TL;DR
What Changed
All major video models fail 'count 1-10 with fingers' benchmark
Why It Matters
Reveals visual realism gap vs. true comprehension, pushing shift to world models. Delays AI replacing human creators in logic-heavy tasks.
What To Do Next
Benchmark your video model on fofr's '1-10 fingers' test from X.
Who should care:Researchers & Academics
🧠 Deep Insight
Web-grounded analysis with 7 cited sources.
🔑 Enhanced Key Takeaways
- •Recent benchmarking studies (VBench-2.0) quantify that video models struggle most with accurately depicting human actions at approximately 50% accuracy, directly validating the counting-with-fingers failure as a systemic limitation across the industry rather than an isolated bug[4].
- •Seedance 1.0 and 1.5 Pro have emerged as market leaders specifically because they excel at 'multi-agent interactions, complex action sequences, and dynamic camera movements while accurately following detailed prompts'—suggesting that hand gesture sequencing remains a differentiation point even among top performers[2][6].
- •The market has bifurcated into two competing approaches: Google Veo 3 and OpenAI Sora 2 are pursuing 'General World Models' that aim to truly simulate physics and temporal consistency, while Runway focuses on creator-speed optimization, indicating that solving the counting test requires fundamental architectural shifts toward physics-aware generation rather than incremental improvements[4].
📊 Competitor Analysis▸ Show
| Model | Strength | Known Limitation | Benchmark Status |
|---|---|---|---|
| Seedance 1.5 Pro | Complex action sequences, prompt adherence | Not explicitly tested on counting task | Ranked #1 on leading video AI benchmark[6] |
| Sora 2 | Action & dynamic scenes | Hand/finger precision not detailed | Top-tier contender, capital-intensive development[4] |
| Google Veo 3 | Cinematic realism, character consistency | Struggles with human action accuracy (~50%)[4] | High-fidelity output leader[1] |
| Kling 2.6 | Realistic human faces, lip-sync, dialogue | Hand rendering weakness implied | Fast generation times[1] |
| Runway Gen-3 | Creator/editing speed | Not positioned for physics-heavy tasks | Best for creator workflow speed[4] |
🛠️ Technical Deep Dive
- Root Cause Analysis: Pixel-prediction architectures lack 3D priors and persistent frame memory, forcing models to predict hand positions frame-by-frame without understanding skeletal constraints or finger anatomy[article summary]
- Hand Anatomy Challenge: Complex hand anatomy with 27 degrees of freedom and sparse training data (hands appear in <5% of video frames) creates a data scarcity problem that statistical prediction cannot overcome[article summary]
- World Model Approach: Emerging solutions like World Labs' Marble pursue structural world understanding by building explicit 3D representations and physics simulators, moving beyond pixel-space prediction[article summary]
- Temporal Consistency Gap: Current models lack frame memory mechanisms to enforce consistency across sequences, causing finger positions and gestures to drift or contradict across frames[article summary]
- Benchmark Quantification: VBench-2.0 reveals that human action accuracy across all leading models hovers around 50%, indicating this is a fundamental architectural limitation rather than a tuning issue[4]
🔮 Future ImplicationsAI analysis grounded in cited sources
World models will become the primary differentiator in video generation by 2027, as pixel-prediction approaches hit a hard ceiling on hand/action tasks.
The counting test exposes that statistical prediction without physics understanding cannot solve sequential hand gestures; companies investing in 3D-aware architectures (World Labs, etc.) will capture premium creator segments.
Hand-gesture benchmarks will emerge as a standard evaluation metric, similar to how image models adopted COCO and ImageNet.
The counting-1-10 test is simple, reproducible, and exposes a critical gap; industry standardization around hand-action benchmarks will accelerate model development and create accountability.
Specialized hand-rendering modules will be integrated into general video models within 12 months, rather than solving the problem end-to-end.
Given the sparse training data and anatomical complexity, hybrid approaches that combine general video generation with dedicated hand-synthesis networks (similar to face-swapping pipelines) will likely emerge as the pragmatic near-term solution.
⏳ Timeline
2025-Q4
VBench-2.0 benchmark released, quantifying human action accuracy at ~50% across leading models, establishing baseline for counting-task failures
2026-01
Seedance 1.5 Pro achieves #1 ranking on leading video AI benchmark, demonstrating that prompt-adherence and action-sequence handling are differentiators
2026-02
Multiple independent testers (YouTube creators, DataCamp, Substack analysts) publish comprehensive rankings confirming Sora 2, Veo 3, and Seedance as top-tier, with hand-rendering gaps noted across all
2026-03
36氪 publishes 'AI Video Models Fail Counting 1-10 Test' article, formalizing the hand-gesture limitation as a critical industry-wide benchmark failure
📎 Sources (7)
Factual claims are grounded in the sources below. Forward-looking analysis is AI-generated interpretation.
📰
Weekly AI Recap
Read this week's curated digest of top AI events →
👉Related Updates
AI-curated news aggregator. All content rights belong to original publishers.
Original source: 36氪 ↗