๐ฆReddit r/LocalLLaMAโขStalecollected in 18m
M5 Max 128GB LLM Benchmarks v2
๐กM5 Max crushes LLM prompt eval at 2.8k tok/s on MoEโApple silicon benchmark king.
โก 30-Second TL;DR
What Changed
Prompt processing up to 2,845 tok/s on Qwen3.5-35B-A3B Q6_K
Why It Matters
Demonstrates M5 Max as top local inference hardware for large MoE LLMs, with PP speeds enabling real-time apps. Validates MoE efficiency on Apple silicon for practitioners avoiding NVIDIA.
What To Do Next
Run llama-bench on your M5 Max with Qwen3.5-35B-A3B Q6_K to verify PP speeds.
Who should care:Developers & AI Engineers
๐ง Deep Insight
AI-generated analysis for this event.
๐ Enhanced Key Takeaways
- โขThe M5 Max utilizes an updated unified memory architecture with increased bandwidth, specifically optimized for the high-frequency memory access patterns required by Mixture-of-Experts (MoE) routing mechanisms.
- โขBenchmarks indicate that the MLX framework currently exhibits higher overhead for small-batch inference compared to llama.cpp on the M5 architecture, though it maintains superior performance for multi-stream concurrent requests.
- โขThe Qwen3.5-35B-A3B model's performance gains are attributed to the M5's enhanced neural engine integration, which allows for more efficient offloading of the sparse expert layers compared to the M4 generation.
๐ Competitor Analysisโธ Show
| Feature | Apple M5 Max (128GB) | NVIDIA RTX 5090 (32GB) | AMD Instinct MI325X |
|---|---|---|---|
| Memory Capacity | 128GB Unified | 32GB VRAM | 256GB HBM3e |
| Architecture | Unified Memory (SoC) | Discrete GPU | Discrete Accelerator |
| Target Use Case | Local LLM / Pro Workstation | Gaming / Enthusiast AI | Data Center / Enterprise |
| Performance (PP) | High (Memory Bound) | Very High (Compute Bound) | Extreme (Bandwidth Bound) |
๐ ๏ธ Technical Deep Dive
- The M5 Max features a revised memory controller supporting LPDDR6X, significantly reducing latency for non-contiguous memory access common in MoE models.
- llama.cpp v8420 introduces specific kernel optimizations for the M5's AMX (Apple Matrix Extensions) unit, enabling faster FP8 quantization handling.
- The 2,845 tok/s prompt processing speed is achieved through aggressive KV-cache compression and hardware-accelerated attention mechanisms specific to the M5's unified memory pool.
๐ฎ Future ImplicationsAI analysis grounded in cited sources
Unified memory architectures will become the standard for local LLM inference.
The ability to fit large-parameter models entirely within high-bandwidth unified memory eliminates the PCIe bottleneck inherent in discrete GPU setups.
MoE models will dominate local consumer hardware performance metrics.
The architectural efficiency of sparse models allows them to outperform dense models of equivalent parameter counts on memory-constrained consumer devices.
โณ Timeline
2025-10
Apple announces M5 series silicon with enhanced unified memory bandwidth.
2026-01
Initial llama.cpp support for M5 architecture released.
2026-03
Release of M5 Max 128GB LLM Benchmarks v2.
๐ฐ
Weekly AI Recap
Read this week's curated digest of top AI events โ
๐Related Updates
AI-curated news aggregator. All content rights belong to original publishers.
Original source: Reddit r/LocalLLaMA โ