๐Ÿฆ™Stalecollected in 58m

Uncensored Qwen3.5 122B INT4 Quant Released

Uncensored Qwen3.5 122B INT4 Quant Released
PostLinkedIn
๐Ÿฆ™Read original on Reddit r/LocalLLaMA
#quantization#uncensored#hardware-cluster#coding-agentqwen3.5-122b-a10b-heretic-int4-autoround

๐Ÿ’กFast uncensored 122B quant for local clustersโ€”ideal for coding agents

โšก 30-Second TL;DR

What Changed

Heretic: INT4 AutoRound quant with tampered uncensored weights

Why It Matters

Enables high-performance uncensored local inference on consumer-grade clusters, lowering barriers for advanced agent and coding workflows.

What To Do Next

Download happypatrick/Qwen3.5-122B-A10B-heretic-int4-AutoRound from Hugging Face and test on your cluster.

Who should care:Developers & AI Engineers

๐Ÿง  Deep Insight

Web-grounded analysis with 8 cited sources.

๐Ÿ”‘ Enhanced Key Takeaways

  • โ€ขQwen3.5-122B-A10B is a multimodal Mixture-of-Experts (MoE) model with 122B total parameters but only ~10B active per token, featuring 256 experts per layer (8 active) across 48 layers and hybrid DeltaNet + standard attention.
  • โ€ขAutoRound INT4 quantization for Qwen3.5 uses W4A16 scheme (4-bit weights, 16-bit activations), keeping vision tower, LM head, normalization, and embeddings at 16-bit, with options like auto-round-best for optimal accuracy.
  • โ€ขLarger Qwen3 models like 14B show greater quantization stability, with only ~1% MMLU drop under 4-bit GPTQ, compared to ~10% for smaller 0.6B models.
  • โ€ขNVFP4 quantization of Qwen3.5-122B-A10B reduces size from 234GB (BF16) to 75.6GB (3.1x compression), fitting on single DGX Spark with 128GB memory, using per-group scales and full MoE expert calibration.

๐Ÿ› ๏ธ Technical Deep Dive

  • โ€ขModel architecture: Hidden dimension 3072, 48 layers, layout 16 ร— (3 ร— (Gated DeltaNet โ†’ MoE) โ†’ 1 ร— (Gated Attention โ†’ MoE)), 64 linear attention heads for V and 16 for QK, expert intermediate dimension 1024.
  • โ€ขContext length: 262,144 tokens natively, extensible to 1,010,000 tokens; supports text, image, video understanding, and think/no-think mode for reasoning.
  • โ€ขINT4 AutoRound: Employs sign gradient descent for optimal weight rounding, W4A16 default (weights to 4-bit INT, activations 16-bit); compatible with vLLM serving; torch_compile speeds tuning by ~25%.
  • โ€ขNVFP4 details: 4-bit floating point weights with FP8 per-group scales (group size 16), uint8 packed; calibrated on 512 ultrachat_200k samples at 2048 seq len; ~1-3% benchmark degradation expected.
  • โ€ขQuantization performance: 4-bit methods show MMLU drops (e.g., Qwen-8B from 74.7 to 69.3), but larger models more robust; INT4 AutoRound often outperforms NVFP4 in accuracy retention.

๐Ÿ”ฎ Future ImplicationsAI analysis grounded in cited sources

INT4 quantized Qwen3.5-122B will enable single-node deployments on 128GB hardware like DGX Spark.
NVFP4 and AutoRound INT4 variants reduce model size to ~75GB from 234GB BF16, fitting unified memory constraints while retaining near-lossless performance on larger scales.
AutoRound will become standard for production INT4 quantization of reasoning MoEs.
It delivers high accuracy via optimized rounding and W4A16 scheme, with vLLM compatibility and minimal speed overhead, outperforming alternatives like NVFP4 in benchmarks.

โณ Timeline

2025-05
arXiv publishes empirical study on Qwen3 quantization robustness across bit-widths and scales.
2026-03
Qwen3.5-122B-A10B released by Alibaba as multimodal MoE with 122B params.
2026-03
Community quantizes Qwen3.5-122B-A10B to NVFP4 for DGX Spark single-node fit.
2026-03
Heretic/Uncensored INT4 AutoRound quant of Qwen3.5-122B released on Reddit r/LocalLLaMA.
๐Ÿ“ฐ

Weekly AI Recap

Read this week's curated digest of top AI events โ†’

๐Ÿ‘‰Related Updates

AI-curated news aggregator. All content rights belong to original publishers.
Original source: Reddit r/LocalLLaMA โ†—

Uncensored Qwen3.5 122B INT4 Quant Released | Reddit r/LocalLLaMA | SetupAI | SetupAI