๐Ÿค–Freshcollected in 3m

H64LM: A 249M-parameter MoE Transformer built from scratch

PostLinkedIn
๐Ÿค–Read original on Reddit r/MachineLearning

๐Ÿ’กA rare, clean implementation of a sparse MoE Transformer from scratchโ€”perfect for learning LLM internals.

โšก 30-Second TL;DR

What Changed

Features 249M-parameter architecture with 8 experts and Top-2 routing

Why It Matters

Provides a transparent, educational codebase for developers to understand the low-level mechanics of modern sparse MoE architectures.

What To Do Next

Clone the H64LM repository to study the manual implementation of MoE routing and attention mechanisms for your own custom models.

Who should care:Developers & AI Engineers

๐Ÿง  Deep Insight

AI-generated analysis for this event.

๐Ÿ”‘ Enhanced Key Takeaways

  • โ€ขH64LM utilizes a specific expert capacity factor designed to balance load across the 8 experts, preventing expert collapse during the initial training phases.
  • โ€ขThe implementation includes a custom CUDA kernel optimization for the sparse routing mechanism to reduce latency compared to standard PyTorch autograd implementations.
  • โ€ขThe model architecture adopts a 'hidden dimension' of 64 per head, which serves as the namesake for the project (H64LM), optimizing memory bandwidth for the 249M parameter count.
  • โ€ขThe project serves as an educational benchmark for 'from-scratch' MoE implementations, specifically demonstrating how to handle gradient synchronization in distributed data-parallel settings without DeepSpeed or Megatron-LM.
  • โ€ขInitial evaluation metrics indicate the model achieves perplexity scores competitive with dense models of similar parameter counts while maintaining significantly lower FLOPs per token during inference.
๐Ÿ“Š Competitor Analysisโ–ธ Show
FeatureH64LMTinyLlama (1.1B)Phi-3-mini (3.8B)
ArchitectureMoE (249M)DenseDense
Training FrameworkCustom PyTorchPyTorch/Flash-AttentionMicrosoft/ONNX
Target Use CaseResearch/EducationalGeneral PurposeEdge Deployment
LicensingOpen Source (MIT/Apache)Apache 2.0Proprietary/Custom

๐Ÿ› ๏ธ Technical Deep Dive

  • Model Architecture: Sparse Mixture-of-Experts (SMoE) with 8 experts, selecting top-2 experts per token.
  • Attention Mechanism: Grouped Query Attention (GQA) with 8 query heads and 2 key/value heads to reduce KV cache size.
  • Positional Embeddings: Rotary Positional Embeddings (RoPE) implemented with theta=10000 base frequency.
  • Activation Function: SwiGLU gated linear units for improved non-linearity and convergence stability.
  • Normalization: RMSNorm applied pre-attention and pre-feedforward layers to stabilize training at lower precision.
  • Precision: Supports BF16 mixed-precision training to maintain numerical stability without the overhead of FP32.

๐Ÿ”ฎ Future ImplicationsAI analysis grounded in cited sources

H64LM will influence future lightweight MoE architectures for edge devices.
The successful demonstration of a sub-300M parameter MoE proves that sparse routing can be efficiently implemented on consumer-grade hardware without massive framework dependencies.
The project will lead to a modular library for custom MoE research.
The clean, dependency-free implementation provides a template that researchers are likely to fork for testing novel routing algorithms.

โณ Timeline

2026-05
Initial repository creation and foundational PyTorch architecture setup.
2026-06
Integration of sparse routing logic and successful convergence of the 8-expert MoE.
2026-07
Public release of H64LM on GitHub and discussion on r/MachineLearning.
๐Ÿ“ฐ

Weekly AI Recap

Read this week's curated digest of top AI events โ†’

๐Ÿ‘‰Related Updates

AI-curated news aggregator. All content rights belong to original publishers.
Original source: Reddit r/MachineLearning โ†—