๐ฆReddit r/LocalLLaMAโขStalecollected in 9h
LFM2-24B-A2B Hits 2x Speed on Strix Halo

๐ก24B model flies 2x faster than peers on AMD Strix Haloโbenchmark your setup now.
โก 30-Second TL;DR
What Changed
Almost 2x faster than gpt-oss-20b on Strix Halo
Why It Matters
Highlights AMD hardware potential for efficient large model inference, potentially shifting local LLM deployments.
What To Do Next
Test LFM2-24B-A2B with ROCm on your Strix Halo for speed benchmarks.
Who should care:Developers & AI Engineers
๐ง Deep Insight
Web-grounded analysis with 4 cited sources.
๐ Enhanced Key Takeaways
- โขLFM2 models use a hybrid Attention-to-Base (A2B) architecture with a 1:3 ratio, replacing traditional quadratic-scaling Softmax Attention with convolutions to reduce KV cache memory overhead[1]
- โขThe LFM2-24B-A2B variant employs Sparse Mixture of Experts (MoE) design, activating only 2.3 billion of its 24 billion total parameters per token, delivering reasoning depth comparable to much larger models while maintaining 2B-parameter-class latency[1]
- โขLiquid AI reports 3x improvement in training efficiency for LFM2 over the previous LFM generation, establishing it as a cost-effective foundation for building general-purpose AI systems[2]
- โขAMD's Ryzen AI Halo mini-PC features unified memory architecture with 256-bit interface supporting up to 96GB GPU access and full ROCm support, positioning it as a compact alternative to NVIDIA's DGX Spark at significantly lower pricing[4]
๐ Competitor Analysisโธ Show
| Feature | LFM2-24B-A2B | Qwen3-30B-A3B | Snowflake gpt-oss-20b |
|---|---|---|---|
| Throughput (H100) | 26.8K tokens/sec | Lower (outperformed) | Lower (outperformed) |
| Active Parameters | 2.3B / 24B total | Not specified | Not specified |
| Architecture | Hybrid Attention-Convolution | Dense Transformer | Dense Transformer |
| CPU Performance | 2x faster decode/prefill vs Qwen3 | Baseline | Baseline |
๐ ๏ธ Technical Deep Dive
- Hybrid Architecture: LFM2 combines 10 double-gated short-range LIV (Linear Input-Output) convolutions with 6 attention blocks (16 total blocks), replacing traditional all-attention Transformers[2]
- Attention-to-Base (A2B) Ratio: 1:3 ratio eliminates quadratic O(Nยฒ) scaling of Softmax Attention, dramatically reducing KV cache memory requirements[1]
- Sparse MoE Design: Only 2.3B of 24B parameters activate per token, enabling inference latency and energy efficiency comparable to 2B models while maintaining reasoning capability of larger models[1]
- Context Window: Supports 32k token window with near-linear scaling in long-context tasks[1]
- Training Infrastructure: 3x improvement in training efficiency over LFM v1, reducing computational cost for model development[2]
- CPU Optimization: Dominates Pareto frontier for both prefill and decode speed on CPU via ExecuTorch and llama.cpp; LFM2-700M outperforms Qwen-0.6B despite being 16% larger[2]
๐ฎ Future ImplicationsAI analysis grounded in cited sources
Sparse MoE + hybrid architectures will become standard for edge AI deployment
LFM2's 2.3B active parameters achieving 24B-equivalent reasoning suggests the industry will shift from dense models to conditional computation for on-device inference.
AMD Ryzen AI Halo will compete directly with NVIDIA's GPU-centric edge AI strategy
Full ROCm support and significantly lower pricing than DGX Spark positions AMD to capture enterprise and developer markets prioritizing cost-efficiency over raw throughput.
Convolution-attention hybrids will reduce memory bottlenecks in long-context applications
LFM2's elimination of quadratic KV cache scaling enables practical 32k+ token windows on consumer hardware, unlocking document-processing and multi-turn agent use cases.
โณ Timeline
2026-02
Liquid AI releases LFM2 family with hybrid Attention-to-Base architecture and Sparse MoE design
2026-02
LFM2-24B-A2B achieves 26.8K tokens/sec throughput on single H100, outperforming Qwen3-30B-A3B and Snowflake gpt-oss-20b
2026-02
AMD Ryzen AI Halo mini-PC announced with 256-bit unified memory interface, full ROCm support, and compact form factor
๐ Sources (4)
Factual claims are grounded in the sources below. Forward-looking analysis is AI-generated interpretation.
- marktechpost.com โ Liquid Ais New Lfm2 24b A2b Hybrid Architecture Blends Attention with Convolutions to Solve the Scaling Bottlenecks of Modern Llms
- liquid.ai โ Liquid Foundation Models V2 Our Second Series of Generative AI Models
- GitHub โ Strix Halo LLM Perf
- hothardware.com โ Amd Ryzen AI Halo Mini Pc
๐ฐ
Weekly AI Recap
Read this week's curated digest of top AI events โ
๐Related Updates
AI-curated news aggregator. All content rights belong to original publishers.
Original source: Reddit r/LocalLLaMA โ
