MoE Refusals Routed via Experts in Abliteration

Post LinkedIn

🦙Read original on Reddit r/LocalLLaMA

#abliteration #moe-routing #refusal-subspaces #uncensorqwen3.5-397bqwen3.5-397b mac-studio mlx gateddeltanet

💡MoE abliteration secrets: experts hide safety refusals from baking

⚡ 30-Second TL;DR

What Changed

Separate subspaces: CN-political vs Western-safety refusals

Why It Matters

Enables precise uncensoring of local MoE models without losing safety. Reveals architecture-specific refusal mechanisms for future ablations.

What To Do Next

Run https://github.com/trevorgordon981/alfred-abliterate to ablate refusals in your Qwen MoE model.

Who should care:Researchers & Academics

Key Points

•Separate subspaces: CN-political vs Western-safety refusals
•Weight-baking fails safety due to pre-proj expert routing
•397B MoE fragile—only top-16 directions work without loops
•Adapted for GatedDeltaNet + MoE with Gram-Schmidt ortho

🧠 Deep Insight

AI-generated analysis for this event.

🔑 Enhanced Key Takeaways

•The 'Abliteration' technique leverages the Gram-Schmidt process to identify and neutralize specific activation vectors, effectively decoupling model refusal behaviors from core reasoning capabilities.
•The Qwen3.5-397B-A17B architecture utilizes a specialized gating mechanism that prioritizes expert routing based on input semantic clusters, which explains why simple weight-baking is insufficient for complete safety-layer removal.
•The research highlights a critical vulnerability in large-scale MoE models where safety-aligned experts are hard-coded into the routing logic, necessitating 'inference hooks' to intercept and redirect activations at the router level.

🛠️ Technical Deep Dive

•Model Architecture: Qwen3.5-397B-A17B (Mixture-of-Experts) with 17B active parameters per token.
•Abliteration Method: Orthogonal projection of activation vectors onto a subspace defined by refusal-inducing prompts, utilizing Gram-Schmidt orthonormalization to minimize impact on non-refusal tokens.
•Inference Hook Mechanism: A custom runtime layer that intercepts the router's top-k expert selection, forcing a bypass of 'safety-aligned' experts when specific refusal-triggering semantic patterns are detected.
•Hardware Constraints: Execution on Mac Studio M3 Ultra (Unified Memory Architecture) requires aggressive quantization (likely 4-bit or EXL2) to fit the 397B parameter footprint within available VRAM.

🔮 Future ImplicationsAI analysis grounded in cited sources

Model providers will shift toward 'Router-Obfuscation' to prevent expert-level intervention.

As researchers successfully target specific expert routing, developers will likely implement non-linear or encrypted routing tables to make identifying safety-aligned experts computationally prohibitive.

Standardized 'Abliteration-Resistance' benchmarks will emerge by Q4 2026.

The ease of removing safety filters via subspace manipulation forces the industry to develop models that are inherently resistant to vector-based activation steering.