M5 Max 128GB LLM Benchmarks v2

Post LinkedIn

🦙Read original on Reddit r/LocalLLaMA

#benchmarks #moe #apple-siliconapple-m5-max

💡M5 Max crushes LLM prompt eval at 2.8k tok/s on MoE—Apple silicon benchmark king.

⚡ 30-Second TL;DR

What Changed

Prompt processing up to 2,845 tok/s on Qwen3.5-35B-A3B Q6_K

Why It Matters

Demonstrates M5 Max as top local inference hardware for large MoE LLMs, with PP speeds enabling real-time apps. Validates MoE efficiency on Apple silicon for practitioners avoiding NVIDIA.

What To Do Next

Run llama-bench on your M5 Max with Qwen3.5-35B-A3B Q6_K to verify PP speeds.

Who should care:Developers & AI Engineers

🧠 Deep Insight

AI-generated analysis for this event.

🔑 Enhanced Key Takeaways

•The M5 Max utilizes an updated unified memory architecture with increased bandwidth, specifically optimized for the high-frequency memory access patterns required by Mixture-of-Experts (MoE) routing mechanisms.
•Benchmarks indicate that the MLX framework currently exhibits higher overhead for small-batch inference compared to llama.cpp on the M5 architecture, though it maintains superior performance for multi-stream concurrent requests.
•The Qwen3.5-35B-A3B model's performance gains are attributed to the M5's enhanced neural engine integration, which allows for more efficient offloading of the sparse expert layers compared to the M4 generation.

📊 Competitor Analysis▸ Show

Feature	Apple M5 Max (128GB)	NVIDIA RTX 5090 (32GB)	AMD Instinct MI325X
Memory Capacity	128GB Unified	32GB VRAM	256GB HBM3e
Architecture	Unified Memory (SoC)	Discrete GPU	Discrete Accelerator
Target Use Case	Local LLM / Pro Workstation	Gaming / Enthusiast AI	Data Center / Enterprise
Performance (PP)	High (Memory Bound)	Very High (Compute Bound)	Extreme (Bandwidth Bound)

🛠️ Technical Deep Dive

The M5 Max features a revised memory controller supporting LPDDR6X, significantly reducing latency for non-contiguous memory access common in MoE models.
llama.cpp v8420 introduces specific kernel optimizations for the M5's AMX (Apple Matrix Extensions) unit, enabling faster FP8 quantization handling.
The 2,845 tok/s prompt processing speed is achieved through aggressive KV-cache compression and hardware-accelerated attention mechanisms specific to the M5's unified memory pool.

🔮 Future ImplicationsAI analysis grounded in cited sources

Unified memory architectures will become the standard for local LLM inference.

The ability to fit large-parameter models entirely within high-bandwidth unified memory eliminates the PCIe bottleneck inherent in discrete GPU setups.

MoE models will dominate local consumer hardware performance metrics.

The architectural efficiency of sparse models allows them to outperform dense models of equivalent parameter counts on memory-constrained consumer devices.

⏳ Timeline

2025-10

Apple announces M5 series silicon with enhanced unified memory bandwidth.

2026-01

Initial llama.cpp support for M5 architecture released.

2026-03

Release of M5 Max 128GB LLM Benchmarks v2.

🦙Read original article on Reddit r/LocalLLaMA

📰

Weekly AI Recap

Read this week's curated digest of top AI events →

👉Related Updates

Same topic

Explore #benchmarks

Same product