Alibaba's Gaode team introduces SpatialGenEval, a comprehensive benchmark for spatial intelligence in text-to-image models, accepted to ICLR 2026. It uses long, information-dense prompts to evaluate 10 dimensions across spatial perception, reasoning, and interaction in 25 real-world scenarios. Tests on 23 SOTA models reveal significant deficiencies in current T2I capabilities.
Key Points
- 1.SpatialGenEval benchmarks 4 major dimensions and 10 sub-dimensions of spatial intelligence
- 2.Covers 25 real-world scenarios with dense, long prompts for complex spatial tasks
- 3.Evaluates 23 SOTA T2I models, exposing gaps in perception, reasoning, and interaction
- 4.Open-source code available on GitHub; paper on arXiv
Impact Analysis
This benchmark highlights shallow spatial cognition in leading T2I models, urging improvements for real-world applications like navigation and AR. It sets a new standard for evaluating spatial logic, potentially accelerating advancements in multimodal AI.
Technical Details
Divides spatial intelligence into perception (attributes, geometry), reasoning (relations, counts), and interaction (physics, occlusion). Employs high-fidelity prompts to probe 'What', 'Where', 'How', and 'Why' in spatial contexts. Results show attribute drift, geometric biases, and logic failures across models.




