Layer Duplication Tops Open LLM Leaderboard

Post LinkedIn

🤖Read original on Reddit r/MachineLearning

#model-hacking #layer-duplication #open-weightsopen-llm-leaderboard

💡Simple layer trick beat leaderboards on 2x4090s—replicate for your open LLM tweaks

⚡ 30-Second TL;DR

What Changed

Duplicating exact 7 middle layers in Qwen2-72B boosts all benchmarks

Why It Matters

Proves small compute can yield leaderboard-topping open models, democratizing SOTA improvements. Suggests LLMs have discrete functional circuits preservable via duplication.

What To Do Next

Replicate the 7-layer duplication on Qwen2-72B using code from the blog to test on Open LLM Leaderboard.

Who should care:Researchers & Academics

🧠 Deep Insight

Web-grounded analysis with 7 cited sources.

🔑 Enhanced Key Takeaways

•The layer duplication technique was first publicly shared by researcher dnhkng on Hacker News, revealing development on consumer-grade 2x RTX 4090 GPUs in a home setup.[4]
•A variant model, Solshine/Qwen2.5-142B-Doubled72B-Math-Instruct, applies full-layer doubling by alternating layers from Qwen2.5-72B-Instruct and Qwen2.5-Math-72B, including MLP adjustments.[1]
•Qwen2-72B base architecture features 80 transformer layers, hidden dimension of 12,288, 64 query heads with 8 KV heads using GQA, SwiGLU nonlinearity, RMSNorm, and RoPE embeddings.[2][3]

🛠️ Technical Deep Dive

•Qwen2.5-72B employs a decoder-only Transformer with 80 layers, hidden size 12,288, Grouped Query Attention (64 query / 8 key-value heads, per-head dim 192), MLP inner dim 49,152.[2][3]
•Training used phased sequence lengths: 4,096 tokens for 80% of steps, then 32,768 tokens for 20%; warmup 5% linear, cosine decay to zero LR, batch 1M tokens (~80 seqs of 32k).[2]
•Model includes pre-LayerNorm RMSNorm for stability, QKV bias for length extrapolation, RoPE with tunable base (10k to 1M in later stages), total ~72B params (70B non-embedding).[2][3]

🔮 Future ImplicationsAI analysis grounded in cited sources

Layer duplication will integrate into Qwen3.5 27B/35A3B as RYS versions

Researcher dnhkng announced running current models like Qwen3.5 on dual GH200 and plans code/new models soon, indicating active extension.[4]

Technique reveals discrete 7-layer functional circuits in pretrained models

Only ~7-layer blocks boost performance while single layers or mismatches degrade it, implying pretraining forms modular circuits preserved intact.[4]

⏳ Timeline

2023-11

Qwen-72B and Qwen-72B-Chat released with 3T tokens training and 32k context support.

2024-09

Qwen2.5-72B-Instruct released with 32.8k context window and advanced features like function calling.

2026-03

Researcher dnhkng shares Qwen2-72B 7-middle-layer duplication topping Open LLM Leaderboard on Hacker News.

📎 Sources (7)

Factual claims are grounded in the sources below. Forward-looking analysis is AI-generated interpretation.

🤖Read original article on Reddit r/MachineLearning

📰

Weekly AI Recap

Read this week's curated digest of top AI events →

👉Related Updates

Same topic

Explore #model-hacking

Same product

Internship Prep Guide for Small Language Models

Reddit r/MachineLearning•Jul 3

AI-curated news aggregator. All content rights belong to original publishers.
Original source: Reddit r/MachineLearning ↗