Taalas HC1: Absurdly Fast, Per-User Inference at 17,000 tokens/second

· Source: The Kaitchup – AI on a Budget · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Emerging Technologies & Innovation, Cloud Computing & IT Infrastructure · Depth: Advanced, quick

Summary

Taalas has introduced "HC1," a "model-on-silicon" chip designed to embed large language models directly into hardware for extremely fast, per-user inference. The HC1 chip runs Meta's Llama 3.1 8B model, achieving approximately 17,000 tokens/second per user, significantly outperforming competitors like Cerebras (~2,000 tokens/sec) and Groq (~600 tokens/sec) for the same model. Built on TSMC N6 (6nm) process technology with an 815 mm² die size, the HC1 operates at around 250W, enabling air-cooled, 10-card server deployments at ~2.5kW. This approach merges storage and computation, eliminating traditional memory-compute bottlenecks. While the initial version uses aggressive 3-6 bit quantization, impacting quality, future iterations aim for improved fidelity. The chip also supports fine-tuning via LoRA adapters, allowing flexibility despite hardwired base weights. Taalas reports a cost of ~$0.0075 per 1M tokens for Llama 3.1 8B, making it 13x cheaper and 8x faster than Cerebras's offering.

Key takeaway

For CTOs and VPs of Engineering evaluating LLM inference solutions, Taalas's HC1 chip presents a compelling option for applications demanding extremely high per-user throughput and cost efficiency. Its ability to deliver ~17,000 tokens/second per user for Llama 3.1 8B at ~$0.0075 per 1M tokens could fundamentally change the economics and user experience of interactive AI. You should investigate its API for use cases requiring rapid, budget-friendly reasoning, especially as future iterations promise improved model quality and support for larger models.

Key insights

Hardwiring LLMs into silicon delivers extreme per-user inference speed and cost efficiency by merging storage and computation.

Principles

Method

Taalas's HC1 chip hardwires an entire LLM, including weights, onto silicon using TSMC N6 process, minimizing programmability while retaining SRAM for KV cache and fine-tuned weights.

In practice

Topics

Best for: CTO, VP of Engineering/Data, Director of AI/ML, AI Engineer, Machine Learning Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by The Kaitchup – AI on a Budget.