SOMs Enable Multi-Dir Refusal Ablation

Post LinkedIn

🦙Read original on Reddit r/LocalLLaMA

#refusal-ablation #self-organizing-maps #uncensoring #latent-manifoldheretic

💡Uncensor OSS models to 3% refusals w/ 0.12 KL—new ablation breakthrough

⚡ 30-Second TL;DR

What Changed

Self-Organizing Maps identify refusal manifolds

Why It Matters

Revolutionizes uncensoring open models with minimal divergence, enabling safer-yet-capable local LLMs. Community testing shows broad applicability across OSS models.

What To Do Next

Apply heretic PR's SOM method to your GPT-OSS-20B model for refusal testing.

Who should care:Researchers & Academics

🧠 Deep Insight

Web-grounded analysis with 8 cited sources.

🔑 Enhanced Key Takeaways

•SOMs mathematically generalize the prior single-direction difference-in-means technique by training on harmful prompt representations to identify multiple neurons in the refusal manifold[1][3].
•Paper accepted at AAAI 2026, authored by researchers from University of Cagliari and University of Genova[1].
•Ablating multiple SOM-derived directions outperforms single-direction baselines and specialized jailbreak algorithms in suppressing refusals[1][3].

🛠️ Technical Deep Dive

•SOMs are trained unsupervised on harmful prompt representations to cluster and identify multiple neurons representing refusal directions.
•Each SOM neuron centroid is subtracted by the harmless prompt centroid to derive multi-directional refusal vectors.
•Ablation involves editing model internals along these directions, baked into weights without runtime overhead.
•Validated on extensive setup showing superiority over single-direction methods and jailbreaks[1][3].

🔮 Future ImplicationsAI analysis grounded in cited sources

SOM ablation will become standard in open-weight model uncensoring pipelines

Its weight-baked edits with low KL divergence enable stacking with other methods like ablation, improving harmlessness without performance loss[1].

Multi-directional techniques will extend to other safety concepts beyond refusal

SOMs' generalization of manifold clustering applies to any low-dimensional concept in LLM latent space, as proven for refusal[1][6].

⏳ Timeline

2024-12

NeurIPS paper establishes single-direction refusal mediation across 13 chat models

2025-07

ICML poster reveals multiple independent refusal directions and concept cones via gradient methods

2025-11

arXiv release of SOMs for Multi-Directional Refusal Suppression paper

2026-02

Heretic repo PR implements SOM refusal ablation with strong benchmarks on GPT-OSS-20B and Qwen3.5-27B

📎 Sources (8)

Factual claims are grounded in the sources below. Forward-looking analysis is AI-generated interpretation.

🦙Read original article on Reddit r/LocalLLaMA

📰

Weekly AI Recap

Read this week's curated digest of top AI events →

👉Related Updates

Same topic

Explore #refusal-ablation

Same product