🦙Reddit r/LocalLLaMA•Mar 16, 2026Stalecollected in 58m

Uncensored Qwen3.5 122B INT4 Quant Released

Post LinkedIn

🦙Read original on Reddit r/LocalLLaMA

#quantization #uncensored #hardware-cluster #coding-agentqwen3.5-122b-a10b-heretic-int4-autoround

💡Fast uncensored 122B quant for local clusters—ideal for coding agents

⚡ 30-Second TL;DR

What Changed

Heretic: INT4 AutoRound quant with tampered uncensored weights

Why It Matters

Enables high-performance uncensored local inference on consumer-grade clusters, lowering barriers for advanced agent and coding workflows.

What To Do Next

Download happypatrick/Qwen3.5-122B-A10B-heretic-int4-AutoRound from Hugging Face and test on your cluster.

Who should care:Developers & AI Engineers

🧠 Deep Insight

Web-grounded analysis with 8 cited sources.

🔑 Enhanced Key Takeaways

•Qwen3.5-122B-A10B is a multimodal Mixture-of-Experts (MoE) model with 122B total parameters but only ~10B active per token, featuring 256 experts per layer (8 active) across 48 layers and hybrid DeltaNet + standard attention.
•AutoRound INT4 quantization for Qwen3.5 uses W4A16 scheme (4-bit weights, 16-bit activations), keeping vision tower, LM head, normalization, and embeddings at 16-bit, with options like auto-round-best for optimal accuracy.
•Larger Qwen3 models like 14B show greater quantization stability, with only ~1% MMLU drop under 4-bit GPTQ, compared to ~10% for smaller 0.6B models.
•NVFP4 quantization of Qwen3.5-122B-A10B reduces size from 234GB (BF16) to 75.6GB (3.1x compression), fitting on single DGX Spark with 128GB memory, using per-group scales and full MoE expert calibration.

🛠️ Technical Deep Dive

•Model architecture: Hidden dimension 3072, 48 layers, layout 16 × (3 × (Gated DeltaNet → MoE) → 1 × (Gated Attention → MoE)), 64 linear attention heads for V and 16 for QK, expert intermediate dimension 1024.
•Context length: 262,144 tokens natively, extensible to 1,010,000 tokens; supports text, image, video understanding, and think/no-think mode for reasoning.
•INT4 AutoRound: Employs sign gradient descent for optimal weight rounding, W4A16 default (weights to 4-bit INT, activations 16-bit); compatible with vLLM serving; torch_compile speeds tuning by ~25%.
•NVFP4 details: 4-bit floating point weights with FP8 per-group scales (group size 16), uint8 packed; calibrated on 512 ultrachat_200k samples at 2048 seq len; ~1-3% benchmark degradation expected.
•Quantization performance: 4-bit methods show MMLU drops (e.g., Qwen-8B from 74.7 to 69.3), but larger models more robust; INT4 AutoRound often outperforms NVFP4 in accuracy retention.

🔮 Future ImplicationsAI analysis grounded in cited sources

INT4 quantized Qwen3.5-122B will enable single-node deployments on 128GB hardware like DGX Spark.

NVFP4 and AutoRound INT4 variants reduce model size to ~75GB from 234GB BF16, fitting unified memory constraints while retaining near-lossless performance on larger scales.

AutoRound will become standard for production INT4 quantization of reasoning MoEs.

It delivers high accuracy via optimized rounding and W4A16 scheme, with vLLM compatibility and minimal speed overhead, outperforming alternatives like NVFP4 in benchmarks.

⏳ Timeline

2025-05

arXiv publishes empirical study on Qwen3 quantization robustness across bit-widths and scales.

2026-03

Qwen3.5-122B-A10B released by Alibaba as multimodal MoE with 122B params.

2026-03

Community quantizes Qwen3.5-122B-A10B to NVFP4 for DGX Spark single-node fit.

2026-03

Heretic/Uncensored INT4 AutoRound quant of Qwen3.5-122B released on Reddit r/LocalLLaMA.

📎 Sources (8)

Factual claims are grounded in the sources below. Forward-looking analysis is AI-generated interpretation.

🦙Read original article on Reddit r/LocalLLaMA

📰

Weekly AI Recap

Read this week's curated digest of top AI events →

👉Related Updates

Same topic

Explore #quantization

Same product

More on qwen3.5-122b-a10b-heretic-int4-autoround

Same source

Latest from Reddit r/LocalLLaMA

AI-curated news aggregator. All content rights belong to original publishers.
Original source: Reddit r/LocalLLaMA ↗