๐Ÿ‡ญ๐Ÿ‡ฐFreshcollected in 1m

Can Chinese Silicon Replace Nvidia for AI Training?

Can Chinese Silicon Replace Nvidia for AI Training?
PostLinkedIn
๐Ÿ‡ญ๐Ÿ‡ฐRead original on SCMP Technology

๐Ÿ’กUnderstand the hardware bottleneck facing Chinese AI and its implications for global supply chain and model development.

โšก 30-Second TL;DR

What Changed

Chinese AI models are highly competitive in performance but lack domestic hardware for pre-training.

Why It Matters

The reliance on foreign silicon for pre-training poses a strategic risk for Chinese AI firms. Future breakthroughs in domestic hardware are essential for achieving true technological sovereignty.

What To Do Next

Evaluate the current performance benchmarks of local Chinese AI accelerators against your specific inference workloads to optimize hardware costs.

Who should care:Researchers & Academics

๐Ÿง  Deep Insight

Web-grounded analysis with 32 cited sources.

๐Ÿ”‘ Enhanced Key Takeaways

  • โ€ขChinese companies like Baidu (Kunlun chips) and Huawei (Ascend chips) are actively developing their own AI accelerators, with Baidu's Kunlun P800 chips powering a 30,000-chip cluster capable of training foundation models with hundreds of billions of parameters.
  • โ€ขDespite significant domestic advancements, Chinese AI data center chips are still estimated by industry executives to lag behind leading international competitors by 5 to 10 years in areas such as efficiency, yields, and memory subsystems.
  • โ€ขUS export controls, initially implemented in October 2022 and subsequently expanded, have severely restricted China's access to high-end AI chips like Nvidia's A100 and H100, accelerating China's push for technological self-sufficiency.
  • โ€ขChinese firms are employing various strategies to circumvent hardware limitations, including optimizing software and algorithms to function effectively with less advanced domestic chips, as demonstrated by DeepSeek's ability to train high-performing models with lower-tier hardware.
  • โ€ขSMIC, China's largest foundry, is making progress in advanced process technology (e.g., 7nm and N+3 process aiming for 5nm-class performance) using older Deep Ultraviolet (DUV) lithography, though this approach is challenged by poor yields and high production costs.
๐Ÿ“Š Competitor Analysisโ–ธ Show
Feature/ChipHuawei Ascend 910/910B/910CBaidu Kunlun P800Biren BR100/BR104Moore Threads Huashan/S5000Nvidia A100/H100 (Reference)
Peak FP16 TFLOPS910: 256-320, 910B: 336-400 (est.), 910C: ~60% of H100 inferenceP800: ~345BR100: ~2000, BR104: Outperformed A100 on some benchmarksS5000: Comparable to foreign GPUs, Huashan: 50% compute density increase over prior designsA100: 312, H100: 1979 (FP16 Tensor)
Memory910: 32GB HBM2, 1200GB/sProprietary, optimized for large modelsBR100: High-bandwidth memoryHuashan: 8 stacks HBM, bandwidth rivaling/exceeding Blackwell B200A100: 40/80GB HBM2, H100: 80GB HBM3
Process Node910B/C: SMIC 7nm (N+2)Kunlun II: 7nmBR100: 7nmS5000: Pinghu architecture (4th gen), Huashan: Huagang architecture (5th gen)A100: TSMC 7nm, H100: TSMC 4N
ScalingAtlas 950 SuperCluster: 520,000+ Ascend 950DT chips, 524 EFLOPS (FP8)>90% efficiency in >5000 unit clustersIncreased training capacity with software optimizationHuashan: Scales beyond 100,000 GPUsDGX SuperPOD, NVLink, NVSwitch
Software EcosystemMindSpore, CANNPaddlePaddleIntegrated with Infini AI cloud platformMUSA (China's answer to CUDA)CUDA
Power (TDP)910: <310WOptimized for energy efficiencyBR104: 300WHuashan: 10x energy efficiency improvementA100: 400W, H100: 700W

๐Ÿ› ๏ธ Technical Deep Dive

  • Huawei Ascend 910/910B/910C: The Ascend 910 contains 32 DaVinci cores, each with 4,096 units capable of FP16 MAC or INT8 MAC operations at 1.0 GHz, yielding a peak performance of 256 Tflop/s (FP16) or 512 TOPS (INT8). It features 84MB of on-chip SRAM and connects to four HBM2 channels delivering 1,228GB/s bandwidth to 32GB of memory. The architecture uses task-specific processing units primarily for neural networks and leverages lower precision for faster training iterations.
  • Baidu Kunlun P800: These chips feature a proprietary architecture with distinct communication and computation units designed for efficient parallel processing. They support advanced strategies like data, tensor, and pipeline parallelism, and incorporate communication-computation fusion and other optimizations to reduce latency by up to 40%. The Kunlun P800 chips are tightly coupled with Baidu's PaddlePaddle framework.
  • Biren BR100: This GPGPU features 77 billion transistors and is designed to be competitive with international benchmarks for AI training and inference. The BR104, a variant, demonstrated lower power consumption (300W TDP) compared to Nvidia's A100 and H100.
  • Moore Threads Huashan (based on Huagang architecture): This AI accelerator utilizes a chiplet-based design with two compute dies and eight stacks of high-bandwidth memory. It incorporates a new generation instruction set and a redesigned asynchronous programming model. The accompanying MUSA software stack is positioned as a domestic alternative to CUDA.

๐Ÿ”ฎ Future ImplicationsAI analysis grounded in cited sources

China will achieve greater self-sufficiency in AI model pre-training within the next 3-5 years.
Ongoing significant investments and advancements in domestic AI chip design and manufacturing, particularly for the computationally intensive pre-training phase, are driven by national strategic goals and persistent export controls.
The global AI hardware market will experience increased fragmentation and the emergence of distinct regional ecosystems.
US export controls are compelling China to develop a complete domestic AI supply chain and software ecosystem (e.g., MUSA, PaddlePaddle), leading to vertically integrated solutions that diverge from global standards.
AI model development strategies will increasingly emphasize algorithmic and software optimization for less advanced hardware.
Chinese firms like DeepSeek are demonstrating the capability to train high-performing AI models using lower-tier chips through optimized algorithms and system architectures, potentially reducing the absolute reliance on cutting-edge hardware.

โณ Timeline

2019
Huawei releases first-generation Ascend 910 AI chip.
2021
Baidu releases second-generation Kunlun II AI chip using a 7nm process.
2022-08
Biren Technology releases its BR100 GPGPU.
2022-10
US implements sweeping export controls on advanced computing and semiconductor manufacturing to China.
2025-04
Baidu launches a 30,000-chip training cluster powered by its third-generation P800 Kunlun chips.
2026-06
A Huawei-led team successfully completes full-parameter post-training of DeepSeek's 1.6-trillion-parameter model using 1,000 Ascend 910C chips.
๐Ÿ“ฐ

Weekly AI Recap

Read this week's curated digest of top AI events โ†’

๐Ÿ‘‰Related Updates

AI-curated news aggregator. All content rights belong to original publishers.
Original source: SCMP Technology โ†—