Frontier AI models excel in advanced math but consistently fail at multi-digit integer addition. Errors primarily stem from operand misalignment or carry failures, explaining most mistakes in top models like Claude, GPT, and Gemini. These issues link to tokenization and random carrying failures.
Key Points
- 1.Accuracy degrades with increasing digit count
- 2.Misalignment and carry errors dominate (87-92% of cases)
- 3.Tokenization contributes to misalignment
Impact Analysis
Highlights fundamental limitations in AI arithmetic, urging fixes for reliable basic computations. Could improve AI tools for math research and education. Reveals need for better tokenization in numerical tasks.
Technical Details
Empirical tests on Claude Opus 4.1, GPT-5, Gemini 2.5 Pro. Interpretable error classes cover vast majority of failures. arXiv:2602.10416v1.