Study evaluates 17 LLMs on ODD-to-Python code generation for predator-prey model. Assesses executability, fidelity, efficiency via NetLogo baseline. GPT-4.1 excels, but reliability varies.
Key Points
- 1.ODD specification translation task
- 2.Statistical validation against baseline
- 3.Limits in scientific reproducibility
Impact Analysis
Clarifies LLMs' role in model engineering. Aids reproducible environmental modeling.
Technical Details
Staged checks for executability and maintainability. Focuses on behavioral faithfulness.