๐Ÿค–Stalecollected in 23h

Layer Duplication Tops Open LLM Leaderboard

PostLinkedIn
๐Ÿค–Read original on Reddit r/MachineLearning

๐Ÿ’กSimple layer trick beat leaderboards on 2x4090sโ€”replicate for your open LLM tweaks

โšก 30-Second TL;DR

What Changed

Duplicating exact 7 middle layers in Qwen2-72B boosts all benchmarks

Why It Matters

Proves small compute can yield leaderboard-topping open models, democratizing SOTA improvements. Suggests LLMs have discrete functional circuits preservable via duplication.

What To Do Next

Replicate the 7-layer duplication on Qwen2-72B using code from the blog to test on Open LLM Leaderboard.

Who should care:Researchers & Academics

๐Ÿง  Deep Insight

Web-grounded analysis with 7 cited sources.

๐Ÿ”‘ Enhanced Key Takeaways

  • โ€ขThe layer duplication technique was first publicly shared by researcher dnhkng on Hacker News, revealing development on consumer-grade 2x RTX 4090 GPUs in a home setup.[4]
  • โ€ขA variant model, Solshine/Qwen2.5-142B-Doubled72B-Math-Instruct, applies full-layer doubling by alternating layers from Qwen2.5-72B-Instruct and Qwen2.5-Math-72B, including MLP adjustments.[1]
  • โ€ขQwen2-72B base architecture features 80 transformer layers, hidden dimension of 12,288, 64 query heads with 8 KV heads using GQA, SwiGLU nonlinearity, RMSNorm, and RoPE embeddings.[2][3]

๐Ÿ› ๏ธ Technical Deep Dive

  • โ€ขQwen2.5-72B employs a decoder-only Transformer with 80 layers, hidden size 12,288, Grouped Query Attention (64 query / 8 key-value heads, per-head dim 192), MLP inner dim 49,152.[2][3]
  • โ€ขTraining used phased sequence lengths: 4,096 tokens for 80% of steps, then 32,768 tokens for 20%; warmup 5% linear, cosine decay to zero LR, batch 1M tokens (~80 seqs of 32k).[2]
  • โ€ขModel includes pre-LayerNorm RMSNorm for stability, QKV bias for length extrapolation, RoPE with tunable base (10k to 1M in later stages), total ~72B params (70B non-embedding).[2][3]

๐Ÿ”ฎ Future ImplicationsAI analysis grounded in cited sources

Layer duplication will integrate into Qwen3.5 27B/35A3B as RYS versions
Researcher dnhkng announced running current models like Qwen3.5 on dual GH200 and plans code/new models soon, indicating active extension.[4]
Technique reveals discrete 7-layer functional circuits in pretrained models
Only ~7-layer blocks boost performance while single layers or mismatches degrade it, implying pretraining forms modular circuits preserved intact.[4]

โณ Timeline

2023-11
Qwen-72B and Qwen-72B-Chat released with 3T tokens training and 32k context support.
2024-09
Qwen2.5-72B-Instruct released with 32.8k context window and advanced features like function calling.
2026-03
Researcher dnhkng shares Qwen2-72B 7-middle-layer duplication topping Open LLM Leaderboard on Hacker News.
๐Ÿ“ฐ

Weekly AI Recap

Read this week's curated digest of top AI events โ†’

๐Ÿ‘‰Related Updates

AI-curated news aggregator. All content rights belong to original publishers.
Original source: Reddit r/MachineLearning โ†—