Layer Duplication Tops Open LLM Leaderboard
๐กSimple layer trick beat leaderboards on 2x4090sโreplicate for your open LLM tweaks
โก 30-Second TL;DR
What Changed
Duplicating exact 7 middle layers in Qwen2-72B boosts all benchmarks
Why It Matters
Proves small compute can yield leaderboard-topping open models, democratizing SOTA improvements. Suggests LLMs have discrete functional circuits preservable via duplication.
What To Do Next
Replicate the 7-layer duplication on Qwen2-72B using code from the blog to test on Open LLM Leaderboard.
๐ง Deep Insight
Web-grounded analysis with 7 cited sources.
๐ Enhanced Key Takeaways
- โขThe layer duplication technique was first publicly shared by researcher dnhkng on Hacker News, revealing development on consumer-grade 2x RTX 4090 GPUs in a home setup.[4]
- โขA variant model, Solshine/Qwen2.5-142B-Doubled72B-Math-Instruct, applies full-layer doubling by alternating layers from Qwen2.5-72B-Instruct and Qwen2.5-Math-72B, including MLP adjustments.[1]
- โขQwen2-72B base architecture features 80 transformer layers, hidden dimension of 12,288, 64 query heads with 8 KV heads using GQA, SwiGLU nonlinearity, RMSNorm, and RoPE embeddings.[2][3]
๐ ๏ธ Technical Deep Dive
- โขQwen2.5-72B employs a decoder-only Transformer with 80 layers, hidden size 12,288, Grouped Query Attention (64 query / 8 key-value heads, per-head dim 192), MLP inner dim 49,152.[2][3]
- โขTraining used phased sequence lengths: 4,096 tokens for 80% of steps, then 32,768 tokens for 20%; warmup 5% linear, cosine decay to zero LR, batch 1M tokens (~80 seqs of 32k).[2]
- โขModel includes pre-LayerNorm RMSNorm for stability, QKV bias for length extrapolation, RoPE with tunable base (10k to 1M in later stages), total ~72B params (70B non-embedding).[2][3]
๐ฎ Future ImplicationsAI analysis grounded in cited sources
โณ Timeline
๐ Sources (7)
Factual claims are grounded in the sources below. Forward-looking analysis is AI-generated interpretation.
Weekly AI Recap
Read this week's curated digest of top AI events โ
๐Related Updates
AI-curated news aggregator. All content rights belong to original publishers.
Original source: Reddit r/MachineLearning โ