๐Ÿฆ™Stalecollected in 21m

SOMs Enable Multi-Dir Refusal Ablation

PostLinkedIn
๐Ÿฆ™Read original on Reddit r/LocalLLaMA

๐Ÿ’กUncensor OSS models to 3% refusals w/ 0.12 KLโ€”new ablation breakthrough

โšก 30-Second TL;DR

What Changed

Self-Organizing Maps identify refusal manifolds

Why It Matters

Revolutionizes uncensoring open models with minimal divergence, enabling safer-yet-capable local LLMs. Community testing shows broad applicability across OSS models.

What To Do Next

Apply heretic PR's SOM method to your GPT-OSS-20B model for refusal testing.

Who should care:Researchers & Academics

๐Ÿง  Deep Insight

Web-grounded analysis with 8 cited sources.

๐Ÿ”‘ Enhanced Key Takeaways

  • โ€ขSOMs mathematically generalize the prior single-direction difference-in-means technique by training on harmful prompt representations to identify multiple neurons in the refusal manifold[1][3].
  • โ€ขPaper accepted at AAAI 2026, authored by researchers from University of Cagliari and University of Genova[1].
  • โ€ขAblating multiple SOM-derived directions outperforms single-direction baselines and specialized jailbreak algorithms in suppressing refusals[1][3].

๐Ÿ› ๏ธ Technical Deep Dive

  • โ€ขSOMs are trained unsupervised on harmful prompt representations to cluster and identify multiple neurons representing refusal directions.
  • โ€ขEach SOM neuron centroid is subtracted by the harmless prompt centroid to derive multi-directional refusal vectors.
  • โ€ขAblation involves editing model internals along these directions, baked into weights without runtime overhead.
  • โ€ขValidated on extensive setup showing superiority over single-direction methods and jailbreaks[1][3].

๐Ÿ”ฎ Future ImplicationsAI analysis grounded in cited sources

SOM ablation will become standard in open-weight model uncensoring pipelines
Its weight-baked edits with low KL divergence enable stacking with other methods like ablation, improving harmlessness without performance loss[1].
Multi-directional techniques will extend to other safety concepts beyond refusal
SOMs' generalization of manifold clustering applies to any low-dimensional concept in LLM latent space, as proven for refusal[1][6].

โณ Timeline

2024-12
NeurIPS paper establishes single-direction refusal mediation across 13 chat models
2025-07
ICML poster reveals multiple independent refusal directions and concept cones via gradient methods
2025-11
arXiv release of SOMs for Multi-Directional Refusal Suppression paper
2026-02
Heretic repo PR implements SOM refusal ablation with strong benchmarks on GPT-OSS-20B and Qwen3.5-27B
๐Ÿ“ฐ

Weekly AI Recap

Read this week's curated digest of top AI events โ†’

๐Ÿ‘‰Related Updates

AI-curated news aggregator. All content rights belong to original publishers.
Original source: Reddit r/LocalLLaMA โ†—