24-Person Team Launches 17K Tokens/Sec Chip

Post LinkedIn

⚛️Read original on 量子位

#inference-speed #cost-reduction #chip-startupex-amd-ai-inference-chip

💡17k tokens/sec at 1/10 Nvidia cost: potential inference revolution for AI devs.

⚡ 30-Second TL;DR

What Changed

24-person team from ex-AMD executives

Why It Matters

This low-cost high-speed chip could disrupt Nvidia's AI hardware monopoly, enabling cheaper large-scale LLM deployments for startups and enterprises.

What To Do Next

Benchmark this chip against Nvidia H100 for your LLM inference workloads to assess cost savings.

Who should care:Developers & AI Engineers

🧠 Deep Insight

Web-grounded analysis with 6 cited sources.

🔑 Enhanced Key Takeaways

•Taalas, led by a 24-person team of former AMD executives, unveiled the HC1 chip achieving 17,000 tokens per second per user on Llama 3.1 8B inference[1][2][3]
•HC1 delivers ~73x higher throughput than Nvidia H200 and multiples above Cerebras (~2,000 tokens/sec) and Groq (~600 tokens/sec) on the same model[1][2][3]
•Chip costs 1/10th the power of Nvidia equivalents and ~20x less to build, using air-cooled PCIe form factor[1][2][6]
•Taalas raised $169 million in funding to develop model-specific AI chips challenging Nvidia dominance[1][5]
•HC1 hardwires the entire model including weights onto the chip using mask ROM recall fabric, eliminating HBM and memory-compute bottlenecks[2][5]

📊 Competitor Analysis▸ Show

Feature	Taalas HC1	Nvidia H200	Cerebras	Groq
Tokens/sec (Llama3.1-8B per user)	17,000 [1][2][3]	~230 (17k/73x) [1]	~2,000 [2]	~600 [2]
Power Consumption	1/10th of Nvidia [1][5]	Baseline [1]	Not specified [2]	Not specified [2]
Cost to Build	20x less than SOTA [6]	Baseline [6]	Not specified	Not specified
Form Factor	PCIe card, ~250W air-cooled [2]	GPU with HBM [5]	Not specified	Not specified

🛠️ Technical Deep Dive

Process/Fab: TSMC N6 (6nm)[2]
Die size: 815 mm²[2]
Power: ~250W per card; 10-card server ~2.5kW, air-cooled[2]
Architecture: Hardwires entire model (weights via mask ROM recall fabric), SRAM for KV cache and fine-tuned weights; single transistor per 4-bit module for matrix multiplications[2][5]
Memory: Eliminates HBM by merging storage and computation, no high-speed I/O or advanced packaging needed[2][5]
Form factor: PCIe card optimized for Llama 3.1 8B[2][5]

🔮 Future ImplicationsAI analysis grounded in cited sources

Taalas HC1 enables interactive frontier models with agentic behavior, reducing task times from hours to minutes at lower cost; unlocks new use cases like real-time reasoning with larger budgets for higher accuracy via multiple sampling or longer traces[2]. Model-specific chips challenge Nvidia by improving efficiency through specialization, potentially accelerating ubiquitous AI deployment[5][6].