A startup claims it broke through a bottleneck that’s holding back LLMs
Summary
AI startup Subquadratic has emerged from stealth, claiming to have resolved a decade-long mathematical bottleneck in large language models by developing SubQ, a new LLM architecture. SubQ reportedly uses sparse attention with dynamic selection, making it faster, cheaper, and more energy-efficient than current models. The company asserts SubQ can process up to 12 times more text, boasting a 12 million token context window, and matches top models from Google DeepMind, OpenAI, and Anthropic on tasks like coding. Independent evaluation by Appen supports these claims, showing SubQ is 56 times faster than FlashAttention and scored 89.7% on LiveCodeBench. It also achieved 98% on needle-in-a-haystack tests with 6 million and 12 million token contexts. Despite initial skepticism due to limited availability and reused Qwen weights, Subquadratic aims to redefine LLM construction.
Key takeaway
For AI Scientists and Machine Learning Engineers evaluating LLM architectures for long-context or cost-sensitive applications, Subquadratic's SubQ presents a potential paradigm shift. Its dynamic sparse attention mechanism offers significantly faster and cheaper processing for large datasets, potentially moving beyond transformer-based models. You should closely monitor its wider availability and further independent validation, particularly for use cases demanding extensive context windows or substantial operational cost reductions.
Key insights
Subquadratic's SubQ LLM uses dynamic sparse attention to overcome the quadratic scaling bottleneck of dense attention, offering significant efficiency gains.
Principles
- Dense attention leads to quadratic computational growth with text length.
- Sparse attention can drastically reduce computations by selective token multiplication.
- Dynamic selection of token relationships is key for effective sparse attention.
Method
SubQ replaces transformer's dense attention with a dynamically selected sparse attention mechanism. This method chooses relevant token relationships on the fly, avoiding the quadratic computational increase of traditional LLMs.
In practice
- Process hundreds of documents or entire code bases efficiently.
- Achieve frontier-level performance in competitive coding problems.
- Retrieve specific information from 12 million token contexts.
Topics
- Large Language Models
- Sparse Attention
- Transformer Architecture
- Computational Efficiency
- Context Window
- AI Benchmarking
- SubQ
Best for: AI Engineer, NLP Engineer, CTO, AI Scientist, Machine Learning Engineer, Director of AI/ML
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by MIT Technology Review.