๐Ÿฆ™Stalecollected in 18m

M5 Max 128GB LLM Benchmarks v2

PostLinkedIn
๐Ÿฆ™Read original on Reddit r/LocalLLaMA

๐Ÿ’กM5 Max crushes LLM prompt eval at 2.8k tok/s on MoEโ€”Apple silicon benchmark king.

โšก 30-Second TL;DR

What Changed

Prompt processing up to 2,845 tok/s on Qwen3.5-35B-A3B Q6_K

Why It Matters

Demonstrates M5 Max as top local inference hardware for large MoE LLMs, with PP speeds enabling real-time apps. Validates MoE efficiency on Apple silicon for practitioners avoiding NVIDIA.

What To Do Next

Run llama-bench on your M5 Max with Qwen3.5-35B-A3B Q6_K to verify PP speeds.

Who should care:Developers & AI Engineers

๐Ÿง  Deep Insight

AI-generated analysis for this event.

๐Ÿ”‘ Enhanced Key Takeaways

  • โ€ขThe M5 Max utilizes an updated unified memory architecture with increased bandwidth, specifically optimized for the high-frequency memory access patterns required by Mixture-of-Experts (MoE) routing mechanisms.
  • โ€ขBenchmarks indicate that the MLX framework currently exhibits higher overhead for small-batch inference compared to llama.cpp on the M5 architecture, though it maintains superior performance for multi-stream concurrent requests.
  • โ€ขThe Qwen3.5-35B-A3B model's performance gains are attributed to the M5's enhanced neural engine integration, which allows for more efficient offloading of the sparse expert layers compared to the M4 generation.
๐Ÿ“Š Competitor Analysisโ–ธ Show
FeatureApple M5 Max (128GB)NVIDIA RTX 5090 (32GB)AMD Instinct MI325X
Memory Capacity128GB Unified32GB VRAM256GB HBM3e
ArchitectureUnified Memory (SoC)Discrete GPUDiscrete Accelerator
Target Use CaseLocal LLM / Pro WorkstationGaming / Enthusiast AIData Center / Enterprise
Performance (PP)High (Memory Bound)Very High (Compute Bound)Extreme (Bandwidth Bound)

๐Ÿ› ๏ธ Technical Deep Dive

  • The M5 Max features a revised memory controller supporting LPDDR6X, significantly reducing latency for non-contiguous memory access common in MoE models.
  • llama.cpp v8420 introduces specific kernel optimizations for the M5's AMX (Apple Matrix Extensions) unit, enabling faster FP8 quantization handling.
  • The 2,845 tok/s prompt processing speed is achieved through aggressive KV-cache compression and hardware-accelerated attention mechanisms specific to the M5's unified memory pool.

๐Ÿ”ฎ Future ImplicationsAI analysis grounded in cited sources

Unified memory architectures will become the standard for local LLM inference.
The ability to fit large-parameter models entirely within high-bandwidth unified memory eliminates the PCIe bottleneck inherent in discrete GPU setups.
MoE models will dominate local consumer hardware performance metrics.
The architectural efficiency of sparse models allows them to outperform dense models of equivalent parameter counts on memory-constrained consumer devices.

โณ Timeline

2025-10
Apple announces M5 series silicon with enhanced unified memory bandwidth.
2026-01
Initial llama.cpp support for M5 architecture released.
2026-03
Release of M5 Max 128GB LLM Benchmarks v2.
๐Ÿ“ฐ

Weekly AI Recap

Read this week's curated digest of top AI events โ†’

๐Ÿ‘‰Related Updates

AI-curated news aggregator. All content rights belong to original publishers.
Original source: Reddit r/LocalLLaMA โ†—