Why Is Muon Only for Transformers?

Post LinkedIn

🤖Read original on Reddit r/MachineLearning

#optimizer #transformers #convnetsmuon

💡Decode why LLM optimizer Muon skips ConvNets despite speed records

⚡ 30-Second TL;DR

What Changed

Rapid adoption of Muon specifically in LLM training workflows

Why It Matters

Highlights potential limitations of promising optimizers, guiding researchers to explore broader applications or alternatives.

What To Do Next

Search Arxiv for 'Muon ConvNet' papers to investigate scalability claims.

Who should care:Researchers & Academics

🧠 Deep Insight

AI-generated analysis for this event.

🔑 Enhanced Key Takeaways

•Muon (Momentum-Unnormalized Optimizer) relies on a second-order approximation that performs a singular value decomposition (SVD) on weight updates, which is computationally expensive and memory-intensive for non-Transformer architectures.
•The optimizer's effectiveness is tightly coupled with the specific weight-tying and parameter-sharing properties of Transformer layers, which do not translate directly to the hierarchical, spatial-feature-focused structure of Convolutional Neural Networks.
•Recent research suggests that Muon's performance gains are highly sensitive to the specific learning rate schedules and batch sizes used in LLM pre-training, making it difficult to tune for the diverse training dynamics of vision models.

📊 Competitor Analysis▸ Show

Optimizer	Primary Use Case	Key Mechanism	Scalability
AdamW	General Purpose	First/Second Moment Estimation	High
Muon	Transformers	SVD-based Update Preconditioning	Moderate
Lion	LLMs/Vision	Sign-based Update	High
Sophia	LLMs	Second-order Hessian Approximation	Moderate

🛠️ Technical Deep Dive

•Muon performs a momentum-based update followed by a projection step that enforces orthogonality on the weight update matrix.
•The core operation involves computing the SVD of the momentum matrix M, specifically M = UΣVᵀ, and replacing it with U Vᵀ to ensure the update is a rotation/reflection.
•It requires significantly more memory than AdamW due to the need to store and compute SVDs for large weight matrices, limiting its application to architectures where weight matrices are large enough to justify the overhead.
•The optimizer is typically applied only to the dense layers of Transformers, as the SVD operation is not well-defined or efficient for the smaller, heterogeneous kernels found in ConvNets.

🔮 Future ImplicationsAI analysis grounded in cited sources

Muon will remain niche to Transformer-based architectures.

The fundamental reliance on SVD for weight updates creates a computational bottleneck that is incompatible with the layer structures of non-Transformer models.

Future iterations will focus on approximate SVD methods to reduce memory overhead.

To expand beyond Transformers, developers must find ways to approximate the orthogonal projection without the full cost of SVD.

⏳ Timeline

2024-10

Muon optimizer introduced by researchers at KAN (Kolmogorov-Arnold Networks) related projects.

2024-12

Initial benchmarks demonstrate Muon outperforming AdamW in training efficiency for large-scale Transformer models.

2025-05

Community reports highlight Muon's difficulty in converging on non-Transformer architectures like ResNet or ConvNeXt.

🤖Read original article on Reddit r/MachineLearning

📰

Weekly AI Recap

Read this week's curated digest of top AI events →

👉Related Updates

Same topic

Explore #optimizer

Same product