Introduces ScratchWorld benchmark with 83 tasks for multimodal GUI agents in Scratch. Uses primitive/composite modes and execution-based evaluation. Exposes reasoning-acting gaps in state-of-the-art agents.
Key Points
- 1.4 task categories: Create/Debug/Extend/Compute
- 2.Visuomotor control assessment
- 3.Runtime program validation
Impact Analysis
Advances evaluation of AI agents in block-based programming education.
Technical Details
Primitive mode for drag-and-drop; composite for semantic APIs.