๐Ÿฆ™Stalecollected in 9h

LFM2-24B-A2B Hits 2x Speed on Strix Halo

LFM2-24B-A2B Hits 2x Speed on Strix Halo
PostLinkedIn
๐Ÿฆ™Read original on Reddit r/LocalLLaMA

๐Ÿ’ก24B model flies 2x faster than peers on AMD Strix Haloโ€”benchmark your setup now.

โšก 30-Second TL;DR

What Changed

Almost 2x faster than gpt-oss-20b on Strix Halo

Why It Matters

Highlights AMD hardware potential for efficient large model inference, potentially shifting local LLM deployments.

What To Do Next

Test LFM2-24B-A2B with ROCm on your Strix Halo for speed benchmarks.

Who should care:Developers & AI Engineers

๐Ÿง  Deep Insight

Web-grounded analysis with 4 cited sources.

๐Ÿ”‘ Enhanced Key Takeaways

  • โ€ขLFM2 models use a hybrid Attention-to-Base (A2B) architecture with a 1:3 ratio, replacing traditional quadratic-scaling Softmax Attention with convolutions to reduce KV cache memory overhead[1]
  • โ€ขThe LFM2-24B-A2B variant employs Sparse Mixture of Experts (MoE) design, activating only 2.3 billion of its 24 billion total parameters per token, delivering reasoning depth comparable to much larger models while maintaining 2B-parameter-class latency[1]
  • โ€ขLiquid AI reports 3x improvement in training efficiency for LFM2 over the previous LFM generation, establishing it as a cost-effective foundation for building general-purpose AI systems[2]
  • โ€ขAMD's Ryzen AI Halo mini-PC features unified memory architecture with 256-bit interface supporting up to 96GB GPU access and full ROCm support, positioning it as a compact alternative to NVIDIA's DGX Spark at significantly lower pricing[4]
๐Ÿ“Š Competitor Analysisโ–ธ Show
FeatureLFM2-24B-A2BQwen3-30B-A3BSnowflake gpt-oss-20b
Throughput (H100)26.8K tokens/secLower (outperformed)Lower (outperformed)
Active Parameters2.3B / 24B totalNot specifiedNot specified
ArchitectureHybrid Attention-ConvolutionDense TransformerDense Transformer
CPU Performance2x faster decode/prefill vs Qwen3BaselineBaseline

๐Ÿ› ๏ธ Technical Deep Dive

  • Hybrid Architecture: LFM2 combines 10 double-gated short-range LIV (Linear Input-Output) convolutions with 6 attention blocks (16 total blocks), replacing traditional all-attention Transformers[2]
  • Attention-to-Base (A2B) Ratio: 1:3 ratio eliminates quadratic O(Nยฒ) scaling of Softmax Attention, dramatically reducing KV cache memory requirements[1]
  • Sparse MoE Design: Only 2.3B of 24B parameters activate per token, enabling inference latency and energy efficiency comparable to 2B models while maintaining reasoning capability of larger models[1]
  • Context Window: Supports 32k token window with near-linear scaling in long-context tasks[1]
  • Training Infrastructure: 3x improvement in training efficiency over LFM v1, reducing computational cost for model development[2]
  • CPU Optimization: Dominates Pareto frontier for both prefill and decode speed on CPU via ExecuTorch and llama.cpp; LFM2-700M outperforms Qwen-0.6B despite being 16% larger[2]

๐Ÿ”ฎ Future ImplicationsAI analysis grounded in cited sources

Sparse MoE + hybrid architectures will become standard for edge AI deployment
LFM2's 2.3B active parameters achieving 24B-equivalent reasoning suggests the industry will shift from dense models to conditional computation for on-device inference.
AMD Ryzen AI Halo will compete directly with NVIDIA's GPU-centric edge AI strategy
Full ROCm support and significantly lower pricing than DGX Spark positions AMD to capture enterprise and developer markets prioritizing cost-efficiency over raw throughput.
Convolution-attention hybrids will reduce memory bottlenecks in long-context applications
LFM2's elimination of quadratic KV cache scaling enables practical 32k+ token windows on consumer hardware, unlocking document-processing and multi-turn agent use cases.

โณ Timeline

2026-02
Liquid AI releases LFM2 family with hybrid Attention-to-Base architecture and Sparse MoE design
2026-02
LFM2-24B-A2B achieves 26.8K tokens/sec throughput on single H100, outperforming Qwen3-30B-A3B and Snowflake gpt-oss-20b
2026-02
AMD Ryzen AI Halo mini-PC announced with 256-bit unified memory interface, full ROCm support, and compact form factor
๐Ÿ“ฐ

Weekly AI Recap

Read this week's curated digest of top AI events โ†’

๐Ÿ‘‰Related Updates

AI-curated news aggregator. All content rights belong to original publishers.
Original source: Reddit r/LocalLLaMA โ†—