Cohere cracks lossless quantization and native citations with first full Apache 2.0 licensed open model Command A+
Summary
Cohere has released Command A+, a 218-billion-parameter language model optimized for complex reasoning, multimodal document processing, and agentic workflows, under a permissive Apache 2.0 open-source license. This decoder-only Sparse Mixture-of-Experts (MoE) Transformer activates only 25 billion parameters per step, significantly reducing compute requirements. A key innovation is its nearly lossless W4A4 quantization, which allows the model to run on a single NVIDIA Blackwell B200 GPU or two NVIDIA H100 GPUs, achieving 375 tokens per second with 113 milliseconds Time-to-First-Token latency. Command A+ also features an optimized tokenizer supporting 48 languages, improving efficiency for non-European languages by up to 20%. It demonstrates strong benchmark performance, including an 85% score on τ²-Bench Telecom and 90% on AIME 25, and offers native citation generation for traceability in regulated industries, alongside a 128K input context window for text and image processing.
Key takeaway
For AI Engineers or MLOps teams prioritizing secure, cost-effective enterprise deployments, Command A+ offers a compelling solution. Its Apache 2.0 license and hardware-efficient W4A4 quantization allow you to run a frontier-grade model on your own infrastructure, reducing vendor lock-in and operational costs. You should evaluate its native citation generation and multimodal capabilities for regulated industry applications, ensuring data privacy and verifiable outputs.
Key insights
Cohere's Command A+ combines sparse MoE architecture with lossless 4-bit quantization for efficient, powerful enterprise AI.
Principles
- Sparse MoE models reduce inference compute.
- Quantization-Aware Distillation mitigates reasoning "tax."
- Apache 2.0 enables true sovereign AI.
Method
Command A+ achieves nearly lossless 4-bit quantization by quantizing MoE experts while keeping critical attention pathways at full precision, supplemented by Quantization-Aware Distillation.
In practice
- Deploy large models on fewer GPUs (e.g., 2x H100).
- Integrate native citations for regulatory compliance.
- Optimize multilingual inference costs.
Topics
- Command A+
- Sparse Mixture-of-Experts
- Quantization
- Apache 2.0 License
- Agentic AI
- Multimodal AI
- Enterprise AI
Best for: CTO, AI Architect, Machine Learning Engineer, AI Engineer, MLOps Engineer, Director of AI/ML
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by VentureBeat.