Proposes principled upscaling for model widths inspired by μP, with theory guaranteeing equivalence to widened versions. Extends μTransfer for hyperparameter scaling, avoiding costly retuning at larger sizes. Applicable to diverse architectures and optimizers with infinite-width analysis.
Key Points
- 1.General upscaling method
- 2.Hyperparameter transfer technique
- 3.Theoretical infinite-width guarantees
Impact Analysis
Speeds up training of large models, improving efficiency for diverse inference budgets. Enables practical knowledge transfer from small to large models.