Apple M4 Neural Engine Unlocked via Reverse Engineering
 机制在需要时“重启”进程继续训练 。</p><p>报道进一步称,这一过程完全没有写入 NAND 闪存,而是把数据和状态都保留在 RAM 中,从而显著提升速度 。在绕开软件限制后,M4 在 iPad 或 Mac 上可达到约 15.8TFLOPS 的 AI 处理性能,足以进行模型训练,而无需依赖昂贵的独立电脑或高端 NVIDIA GPU 。</p><p><img src="https://static.cnbetacdn.com/article/2026/0616/9d8ceec3c3fd536.jpg)
💡Unlock 15.8 TFLOPS of AI compute on M4 chips by bypassing Apple's software restrictions.
⚡ 30-Second TL;DR
What Changed
Successfully bypassed Apple's software-locked Neural Engine.
Why It Matters
This breakthrough allows developers to run high-performance AI models directly on M4 hardware without Apple's proprietary software stack. It opens new possibilities for local LLM inference and edge computing on Mac devices.
What To Do Next
Explore the custom MIL implementation to test local model inference performance on your M4-based Mac hardware.
🧠 Deep Insight
Web-grounded analysis with 33 cited sources.
🔑 Enhanced Key Takeaways
- •The reverse engineering effort, led by developer maderix, involved mapping Apple's internal software stack, discovering private Objective-C APIs (
_ANEClient,_ANECompiler,_ANEInMemoryModelDescriptor), and cracking the proprietary Model Intermediate Language (MIL) compilation path and E5 binary format to achieve direct hardware access. - •The 15.8 TFLOPS performance achieved is specifically for FP16 compute, clarifying Apple's advertised 38 TOPS (Trillion Operations Per Second) which is often for INT8 operations and can be misleading as the ANE dequantizes INT8 weights to FP16 before computation.
- •The bypass enabled full backpropagation and transformer training directly on the M4's Neural Engine, a capability previously restricted by Apple's software to inference-only workloads via frameworks like CoreML and MLX.
- •The custom implementation operated entirely in RAM, avoiding slower NAND flash writes, and demonstrated the ANE as a dedicated graph execution engine optimized for neural network graphs rather than a general-purpose processor.
📊 Competitor Analysis▸ Show
| Feature / Competitor | Apple M4 Neural Engine | Qualcomm Snapdragon X Elite (NPU) | Intel Core Ultra (NPU) | AMD Ryzen AI (NPU) | NVIDIA Laptop GPUs (e.g., RTX 5070 Ada) |
|---|---|---|---|---|---|
| AI Performance (Peak) | 38 TOPS (INT8), ~19 TFLOPS (FP16) | 45 TOPS (INT4/INT8) | Up to 74 TOPS (INT8, Nova Lake-S) | Up to 50 TOPS (INT8), 25 TFLOPS (BF16, XDNA2) | Up to 798 AI TOPS (INT8), 23.22 TFLOPS (FP16/FP32, RTX 5070) |
| Primary Use Case | On-device inference (officially), now training (via reverse engineering) | On-device AI acceleration, LLMs | On-device AI acceleration, energy efficiency | On-device inference and training, real-time generative AI | High-performance AI training and inference, graphics |
| Software Stack | CoreML, MLX (official), custom MIL (reverse engineered) | Hexagon NPU SDK | OpenVINO, Windows ML | AMD IRON, MLIR-AIR | CUDA, TensorRT |
| Memory Architecture | Unified Memory (shared with CPU/GPU) | LPDDR5x-8448 (part of SoC) | Integrated NPU within SoC | Unified memory (shared with CPU/GPU) | Dedicated VRAM (e.g., 8GB GDDR7) |
🛠️ Technical Deep Dive
- The M4 Neural Engine (codename H16G) features 16 cores, a queue depth of 127 evaluation requests, independent Dynamic Voltage and Frequency Scaling (DVFS), and hard power gating for efficiency.
- It functions as a graph execution engine, specifically designed for fixed-function acceleration of compiled neural network graphs, executing them as atomic operations, rather than a general-purpose CPU or GPU.
- The reverse engineering process involved identifying and utilizing private Objective-C APIs, including
_ANEClient,_ANECompiler, and_ANEInMemoryModelDescriptor, and deciphering the proprietary E5 binary format. - The custom Model Intermediate Language (MIL) implementation enabled direct compilation and execution of compute graphs on the ANE, bypassing Apple's official CoreML framework.
- The M4 chip is manufactured using TSMC's second-generation 3-nanometer process and integrates 28 billion transistors.
- It is reportedly Apple's first SoC to adopt the ARMv9 CPU architecture, supporting the Scalable Matrix Extension (SME) but notably lacking Scalable Vector Extension (SVE) support.
- The Neural Engine dequantizes INT8 weights to FP16 prior to computation, indicating that its true peak performance for AI workloads is approximately 19 TFLOPS (FP16), despite higher INT8 TOPS marketing figures.
- Apple's unified memory architecture allows the CPU, GPU, and Neural Engine to share a single memory pool, which helps mitigate data transfer bottlenecks, with the M4 Max variant offering up to 400 GB/s of memory bandwidth.
🔮 Future ImplicationsAI analysis grounded in cited sources
⏳ Timeline
📎 Sources (33)
Factual claims are grounded in the sources below. Forward-looking analysis is AI-generated interpretation.
- substack.com
- daily.dev
- themenonlab.blog
- medium.com
- medium.com
- wccftech.com
- wikipedia.org
- apple.com
- articsledge.com
- wikipedia.org
- mayhemcode.com
- medium.com
- reddit.com
- notebookcheck.net
- hothardware.com
- notebookcheck.net
- tomshardware.com
- laptopmag.com
- reddit.com
- emergentmind.com
- amd.com
- notebookcheck.net
- techpowerup.com
- youtube.com
- wccftech.com
- willitrunai.com
- localaimaster.com
- fandom.com
- trymirai.com
- mirabilisdesign.com
- jdhodges.com
- backmarket.com.au
- cgchannel.com
Weekly AI Recap
Read this week's curated digest of top AI events →
👉Related Updates
AI-curated news aggregator. All content rights belong to original publishers.
Original source: cnBeta (Full RSS) ↗


