MLLMs excel in perception but fail mathematical spatial reasoning, scoring under 60% on tasks humans solve at 95% accuracy. MathSpatial introduces a framework with MathSpatial-Bench (2K problems), MathSpatial-Corpus (8K training data), and MathSpatial-SRT for structured reasoning. Fine-tuning Qwen2.5-VL-7B achieves strong results with 25% fewer tokens.
Key Points
- 1.MLLMs score under 60% on mathematical spatial tasks humans solve at 95% accuracy
- 2.MathSpatial framework includes MathSpatial-Bench with 2K problems, MathSpatial-Corpus with 8K training data, and MathSpatial-SRT
- 3.Fine-tuning Qwen2.5-VL-7B yields strong results using 25% fewer tokens
Impact Analysis
AI researchers and MLLM developers benefit from new benchmarks and training data to address spatial reasoning weaknesses. It matters because it reveals a key limitation in vision-language models, essential for applications like robotics and navigation. This could accelerate progress toward human-level spatial intelligence in AI.
Technical Details
MathSpatial provides MathSpatial-Bench for evaluation (2K problems), MathSpatial-Corpus for training (8K samples), and MathSpatial-SRT to generate structured reasoning traces. Fine-tuning Qwen2.5-VL-7B on this data improves performance on spatial math tasks while reducing inference tokens by 25%. The framework targets multimodal large language models' perception-reasoning gap.