Taalas Specializes to Extremes for Extraordinary Token Speed

2026-02-19 · Source: Big Data & AI News - EE Times · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Emerging Technologies & Innovation, AI Hardware Engineering · Depth: Advanced, medium

Summary

AI chip startup Taalas, co-founded by Ljubisa Bajic, has developed the HC1 chip, which achieves over 16,000 tokens per second per user for Llama3.1-8B, significantly outperforming competitors like Nvidia, Cerebras, and Groq. This performance is attained by hardwiring the entire model, including its weights, onto the chip, sacrificing almost all programmability. The HC1, built on TSMC N6 with an 815 mm2 die size, consumes around 250W and fits the 8B model on a single chip, enabling deployment in standard air-cooled racks. Taalas leverages a technique similar to early 2000s structured ASICs, changing only two masks to customize chips for specific models, allowing for rapid, cost-effective tape-outs. The company aims for a two-month turnaround for custom model-specific chips and projects favorable total cost of ownership, even with annual chip replacements.

Key takeaway

For MLOps engineers optimizing large language model inference, Taalas' hardwired chip approach presents a compelling alternative to general-purpose GPUs. If your application relies on a stable, high-volume LLM like Llama3.1-8B, exploring model-specific silicon could drastically reduce inference costs to 0.75 cents per million tokens and boost throughput to over 16,000 tokens/second, despite requiring annual chip refreshes. You should assess the long-term stability of your chosen model against the benefits of extreme hardware optimization.

Key insights

Extreme specialization in AI chips by hardwiring models can yield superior performance and cost efficiency.

Principles

Trade flexibility for performance and economics.
Simplify software by hardwiring models into hardware.

Method

Taalas customizes chips by altering only two masks, which define both model weights and dataflow, enabling rapid adaptation for specific LLMs. This process is highly automated, reducing turnaround time to approximately two months.

In practice

Consider model-specific silicon for stable, high-volume LLM inference.
Evaluate TCO for specialized chips against GPU refresh cycles.

Topics

AI Chip Architecture
Large Language Models
ASIC Design
Model Inference
Hardware Acceleration

Best for: MLOps Engineer, Machine Learning Engineer, NLP Engineer, AI Engineer, AI Architect, Director of AI/ML

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Big Data & AI News - EE Times.