Apple M4 Neural Engine Unlocked via Reverse Engineering

🔑 Enhanced Key Takeaways

•The reverse engineering effort, led by developer maderix, involved mapping Apple's internal software stack, discovering private Objective-C APIs (_ANEClient, _ANECompiler, _ANEInMemoryModelDescriptor), and cracking the proprietary Model Intermediate Language (MIL) compilation path and E5 binary format to achieve direct hardware access.
•The 15.8 TFLOPS performance achieved is specifically for FP16 compute, clarifying Apple's advertised 38 TOPS (Trillion Operations Per Second) which is often for INT8 operations and can be misleading as the ANE dequantizes INT8 weights to FP16 before computation.
•The bypass enabled full backpropagation and transformer training directly on the M4's Neural Engine, a capability previously restricted by Apple's software to inference-only workloads via frameworks like CoreML and MLX.
•The custom implementation operated entirely in RAM, avoiding slower NAND flash writes, and demonstrated the ANE as a dedicated graph execution engine optimized for neural network graphs rather than a general-purpose processor.

📊 Competitor Analysis▸ Show

Feature / Competitor	Apple M4 Neural Engine	Qualcomm Snapdragon X Elite (NPU)	Intel Core Ultra (NPU)	AMD Ryzen AI (NPU)	NVIDIA Laptop GPUs (e.g., RTX 5070 Ada)
AI Performance (Peak)	38 TOPS (INT8), ~19 TFLOPS (FP16)	45 TOPS (INT4/INT8)	Up to 74 TOPS (INT8, Nova Lake-S)	Up to 50 TOPS (INT8), 25 TFLOPS (BF16, XDNA2)	Up to 798 AI TOPS (INT8), 23.22 TFLOPS (FP16/FP32, RTX 5070)
Primary Use Case	On-device inference (officially), now training (via reverse engineering)	On-device AI acceleration, LLMs	On-device AI acceleration, energy efficiency	On-device inference and training, real-time generative AI	High-performance AI training and inference, graphics
Software Stack	CoreML, MLX (official), custom MIL (reverse engineered)	Hexagon NPU SDK	OpenVINO, Windows ML	AMD IRON, MLIR-AIR	CUDA, TensorRT
Memory Architecture	Unified Memory (shared with CPU/GPU)	LPDDR5x-8448 (part of SoC)	Integrated NPU within SoC	Unified memory (shared with CPU/GPU)	Dedicated VRAM (e.g., 8GB GDDR7)

🛠️ Technical Deep Dive

The M4 Neural Engine (codename H16G) features 16 cores, a queue depth of 127 evaluation requests, independent Dynamic Voltage and Frequency Scaling (DVFS), and hard power gating for efficiency.
It functions as a graph execution engine, specifically designed for fixed-function acceleration of compiled neural network graphs, executing them as atomic operations, rather than a general-purpose CPU or GPU.
The reverse engineering process involved identifying and utilizing private Objective-C APIs, including _ANEClient, _ANECompiler, and _ANEInMemoryModelDescriptor, and deciphering the proprietary E5 binary format.
The custom Model Intermediate Language (MIL) implementation enabled direct compilation and execution of compute graphs on the ANE, bypassing Apple's official CoreML framework.
The M4 chip is manufactured using TSMC's second-generation 3-nanometer process and integrates 28 billion transistors.
It is reportedly Apple's first SoC to adopt the ARMv9 CPU architecture, supporting the Scalable Matrix Extension (SME) but notably lacking Scalable Vector Extension (SVE) support.
The Neural Engine dequantizes INT8 weights to FP16 prior to computation, indicating that its true peak performance for AI workloads is approximately 19 TFLOPS (FP16), despite higher INT8 TOPS marketing figures.
Apple's unified memory architecture allows the CPU, GPU, and Neural Engine to share a single memory pool, which helps mitigate data transfer bottlenecks, with the M4 Max variant offering up to 400 GB/s of memory bandwidth.

🔮 Future ImplicationsAI analysis grounded in cited sources

Increased accessibility for on-device AI model training.

Bypassing software restrictions could lead to community-driven tools and frameworks that enable developers to leverage the Neural Engine for training, expanding local AI capabilities beyond inference.

Potential for enhanced privacy and efficiency in AI applications.

Performing AI training and inference entirely on-device, without cloud reliance, inherently improves data privacy and can lead to more energy-efficient AI workloads.

Apple may respond with stricter hardware lockdowns or new developer APIs.

The reverse engineering could prompt Apple to either further secure their hardware against such bypasses or, conversely, to release more flexible official APIs to support advanced AI development on their platform.

⏳ Timeline

2017-09

Apple introduces the first Neural Engine (0.6 TOPS) in the A11 Bionic chip, along with the Core ML framework for developers.

2020-11

Apple releases the M1 chip, integrating a 16-core Neural Engine capable of 11 TOPS, bringing dedicated AI acceleration to Macs.

2022-06

The M2 chip is introduced, featuring an improved Neural Engine with performance around 15.8 TOPS.

2023-10

Apple launches the M3 chip, with its Neural Engine offering approximately 18 TOPS (INT16 operations), or 35 TOPS (INT8).

2024-05

The Apple M4 chip is unveiled, featuring a Neural Engine capable of 38 TOPS (INT8), marketed as Apple's most powerful to date.

2026-06-16

Developers successfully bypass Apple's software restrictions on the M4 Neural Engine, unlocking 15.8 TFLOPS of FP16 AI compute for training.

Apple M4 Neural Engine Unlocked via Reverse Engineering

⚡ 30-Second TL;DR

🧠 Deep Insight

🔑 Enhanced Key Takeaways

🛠️ Technical Deep Dive

🔮 Future ImplicationsAI analysis grounded in cited sources

⏳ Timeline

📎 Sources (33)

👉Related Updates

macOS 27 marks the end of the Hackintosh era

Pentagon scales up 'Loyal Wingman' autonomous fighter fleet

Windows 11 26H2 continues small-step update strategy