Uncensored Qwen3.5 122B INT4 Quant Released

๐กFast uncensored 122B quant for local clustersโideal for coding agents
โก 30-Second TL;DR
What Changed
Heretic: INT4 AutoRound quant with tampered uncensored weights
Why It Matters
Enables high-performance uncensored local inference on consumer-grade clusters, lowering barriers for advanced agent and coding workflows.
What To Do Next
Download happypatrick/Qwen3.5-122B-A10B-heretic-int4-AutoRound from Hugging Face and test on your cluster.
๐ง Deep Insight
Web-grounded analysis with 8 cited sources.
๐ Enhanced Key Takeaways
- โขQwen3.5-122B-A10B is a multimodal Mixture-of-Experts (MoE) model with 122B total parameters but only ~10B active per token, featuring 256 experts per layer (8 active) across 48 layers and hybrid DeltaNet + standard attention.
- โขAutoRound INT4 quantization for Qwen3.5 uses W4A16 scheme (4-bit weights, 16-bit activations), keeping vision tower, LM head, normalization, and embeddings at 16-bit, with options like auto-round-best for optimal accuracy.
- โขLarger Qwen3 models like 14B show greater quantization stability, with only ~1% MMLU drop under 4-bit GPTQ, compared to ~10% for smaller 0.6B models.
- โขNVFP4 quantization of Qwen3.5-122B-A10B reduces size from 234GB (BF16) to 75.6GB (3.1x compression), fitting on single DGX Spark with 128GB memory, using per-group scales and full MoE expert calibration.
๐ ๏ธ Technical Deep Dive
- โขModel architecture: Hidden dimension 3072, 48 layers, layout 16 ร (3 ร (Gated DeltaNet โ MoE) โ 1 ร (Gated Attention โ MoE)), 64 linear attention heads for V and 16 for QK, expert intermediate dimension 1024.
- โขContext length: 262,144 tokens natively, extensible to 1,010,000 tokens; supports text, image, video understanding, and think/no-think mode for reasoning.
- โขINT4 AutoRound: Employs sign gradient descent for optimal weight rounding, W4A16 default (weights to 4-bit INT, activations 16-bit); compatible with vLLM serving; torch_compile speeds tuning by ~25%.
- โขNVFP4 details: 4-bit floating point weights with FP8 per-group scales (group size 16), uint8 packed; calibrated on 512 ultrachat_200k samples at 2048 seq len; ~1-3% benchmark degradation expected.
- โขQuantization performance: 4-bit methods show MMLU drops (e.g., Qwen-8B from 74.7 to 69.3), but larger models more robust; INT4 AutoRound often outperforms NVFP4 in accuracy retention.
๐ฎ Future ImplicationsAI analysis grounded in cited sources
โณ Timeline
๐ Sources (8)
Factual claims are grounded in the sources below. Forward-looking analysis is AI-generated interpretation.
Weekly AI Recap
Read this week's curated digest of top AI events โ
๐Related Updates
AI-curated news aggregator. All content rights belong to original publishers.
Original source: Reddit r/LocalLLaMA โ