Presents structured capabilities model to extract interpretable LLM capabilities from benchmarks, addressing construct validity. Outperforms latent factor models on fit and scaling laws on prediction using OpenLLM Leaderboard data. Combines scaling laws and latent factors by separating model scale from capabilities.
Key Points
- 1.Introduces structured capabilities model combining scaling laws and latent factor models
- 2.Outperforms alternatives on parsimonious fit and out-of-distribution prediction
- 3.Fitted on large OpenLLM Leaderboard results sample
- 4.Separates model scale (informs capabilities) from observed scores (up to error)
Impact Analysis
Enhances LLM evaluation reliability, enabling better model selection beyond contaminated benchmarks. Aids researchers in predicting true capabilities across unseen tasks.
Technical Details
Model uses scaling laws where scale informs latent capabilities, then capabilities predict benchmark scores up to measurement error. Beats latent factors (ignore scale) and scaling laws (ignore error) on OpenLLM data.