LFM2-24B-A2B Hits 2x Speed on Strix Halo

Post LinkedIn

🦙Read original on Reddit r/LocalLLaMA

#amd-hardware #model-speed #rocmlfm2-24b-a2b

💡24B model flies 2x faster than peers on AMD Strix Halo—benchmark your setup now.

⚡ 30-Second TL;DR

What Changed

Almost 2x faster than gpt-oss-20b on Strix Halo

Why It Matters

Highlights AMD hardware potential for efficient large model inference, potentially shifting local LLM deployments.

What To Do Next

Test LFM2-24B-A2B with ROCm on your Strix Halo for speed benchmarks.

Who should care:Developers & AI Engineers

🧠 Deep Insight

Web-grounded analysis with 4 cited sources.

🔑 Enhanced Key Takeaways

•LFM2 models use a hybrid Attention-to-Base (A2B) architecture with a 1:3 ratio, replacing traditional quadratic-scaling Softmax Attention with convolutions to reduce KV cache memory overhead[1]
•The LFM2-24B-A2B variant employs Sparse Mixture of Experts (MoE) design, activating only 2.3 billion of its 24 billion total parameters per token, delivering reasoning depth comparable to much larger models while maintaining 2B-parameter-class latency[1]
•Liquid AI reports 3x improvement in training efficiency for LFM2 over the previous LFM generation, establishing it as a cost-effective foundation for building general-purpose AI systems[2]
•AMD's Ryzen AI Halo mini-PC features unified memory architecture with 256-bit interface supporting up to 96GB GPU access and full ROCm support, positioning it as a compact alternative to NVIDIA's DGX Spark at significantly lower pricing[4]

📊 Competitor Analysis▸ Show

Feature	LFM2-24B-A2B	Qwen3-30B-A3B	Snowflake gpt-oss-20b
Throughput (H100)	26.8K tokens/sec	Lower (outperformed)	Lower (outperformed)
Active Parameters	2.3B / 24B total	Not specified	Not specified
Architecture	Hybrid Attention-Convolution	Dense Transformer	Dense Transformer
CPU Performance	2x faster decode/prefill vs Qwen3	Baseline	Baseline

🛠️ Technical Deep Dive

Hybrid Architecture: LFM2 combines 10 double-gated short-range LIV (Linear Input-Output) convolutions with 6 attention blocks (16 total blocks), replacing traditional all-attention Transformers[2]
Attention-to-Base (A2B) Ratio: 1:3 ratio eliminates quadratic O(N²) scaling of Softmax Attention, dramatically reducing KV cache memory requirements[1]
Sparse MoE Design: Only 2.3B of 24B parameters activate per token, enabling inference latency and energy efficiency comparable to 2B models while maintaining reasoning capability of larger models[1]
Context Window: Supports 32k token window with near-linear scaling in long-context tasks[1]
Training Infrastructure: 3x improvement in training efficiency over LFM v1, reducing computational cost for model development[2]
CPU Optimization: Dominates Pareto frontier for both prefill and decode speed on CPU via ExecuTorch and llama.cpp; LFM2-700M outperforms Qwen-0.6B despite being 16% larger[2]

🔮 Future ImplicationsAI analysis grounded in cited sources

Sparse MoE + hybrid architectures will become standard for edge AI deployment

LFM2's 2.3B active parameters achieving 24B-equivalent reasoning suggests the industry will shift from dense models to conditional computation for on-device inference.

AMD Ryzen AI Halo will compete directly with NVIDIA's GPU-centric edge AI strategy

Full ROCm support and significantly lower pricing than DGX Spark positions AMD to capture enterprise and developer markets prioritizing cost-efficiency over raw throughput.

Convolution-attention hybrids will reduce memory bottlenecks in long-context applications

LFM2's elimination of quadratic KV cache scaling enables practical 32k+ token windows on consumer hardware, unlocking document-processing and multi-turn agent use cases.

⏳ Timeline

2026-02

Liquid AI releases LFM2 family with hybrid Attention-to-Base architecture and Sparse MoE design

2026-02

LFM2-24B-A2B achieves 26.8K tokens/sec throughput on single H100, outperforming Qwen3-30B-A3B and Snowflake gpt-oss-20b

2026-02

AMD Ryzen AI Halo mini-PC announced with 256-bit unified memory interface, full ROCm support, and compact form factor

📎 Sources (4)

Factual claims are grounded in the sources below. Forward-looking analysis is AI-generated interpretation.

🦙Read original article on Reddit r/LocalLLaMA

📰

Weekly AI Recap

Read this week's curated digest of top AI events →

👉Related Updates

Same topic

Explore #amd-hardware

Same product

Qwen 3.6 Voting Results Finalized

Reddit r/LocalLLaMA•Apr 10

AI-curated news aggregator. All content rights belong to original publishers.
Original source: Reddit r/LocalLLaMA ↗