[AINews] Cognition raises $1B in $26B Series D
Summary
Cognition, an independent AI agent lab, recently secured over \$1B in a Series D round, reaching a \$26B valuation and projecting over \$1B ARR by EOY, with current run-rate revenue at \$492M. This marks a 2.5x value increase in eight months. Concurrently, the AI landscape is seeing rapid advancements. Inference optimization is shifting towards architectural solutions like speculative decoding and KV-cache management, enabling sustainable API price reductions. Agent development emphasizes "model-harness-memory fit" with tools like LangChain's Deep Agents and the emergence of continual learning platforms. New benchmarks, including DeepSWE and ITBench-AA, focus on complex, real-world workflows. Training research introduces innovations like Sakana AI's DiffusionBlocks. Key model releases include ESMFold2 for protein prediction, Gemini Embedding 2 for multimodal embeddings, and Surya OCR 2. Developer platforms are integrating coding agents into comprehensive product stacks with advanced enterprise controls, exemplified by OpenAI and Claude Code updates. Local AI also advances with low-bit models and optimized inference engines, alongside new Qwen 3.5/3.6 local model releases.
Key takeaway
For AI Scientists and Machine Learning Engineers optimizing model deployment, you should prioritize architectural inference improvements like KV-cache management and attention design to achieve sustainable cost reductions. Evaluate agentic systems based on "model-harness-memory fit" rather than just model quality, and integrate continual learning platforms for post-deployment model adaptation. When deploying local LLMs, consider low-bit quantization and optimized engines for consumer hardware. This shift demands a holistic approach to system design, moving beyond isolated model performance.
Key insights
AI development is rapidly maturing, driven by architectural inference optimizations, agentic system integration, and specialized model releases.
Principles
- Inference cost cuts stem from attention design and cache hierarchy.
- Task-harness fit is crucial for agent performance.
- Post-deployment learning is becoming standard infra.
Method
DeepSeek V4-Pro uses hybrid attention (Compressed Sparse, Heavily Compressed) to reduce KV cache size to ~10% and single-token inference FLOPs to 27% for 1M-token contexts.
In practice
- Use MTP/speculative decoding for faster local LLM inference.
- Periodically summarize and reset agent sessions for long tasks.
- Explore NVFP4 GGUF builds for NVIDIA low-precision inference.
Topics
- AI Agent Labs
- Inference Optimization
- Continual Learning
- LLM Benchmarks
- Multimodal Models
- Low-Bit Quantization
- Coding Agents
Code references
Best for: Investor, CTO, VP of Engineering/Data, AI Scientist, Machine Learning Engineer, Director of AI/ML
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Latent.Space - Www.latent.space.