Cohere cracks lossless quantization and native citations with first full Apache 2.0 licensed open model Command A+

· Source: VentureBeat · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Cloud Computing & IT Infrastructure · Depth: Advanced, medium

Summary

Cohere has released Command A+, a 218-billion-parameter language model optimized for complex reasoning, multimodal document processing, and agentic workflows, under a permissive Apache 2.0 open-source license. This decoder-only Sparse Mixture-of-Experts (MoE) Transformer activates only 25 billion parameters per step, significantly reducing compute requirements. A key innovation is its nearly lossless W4A4 quantization, which allows the model to run on a single NVIDIA Blackwell B200 GPU or two NVIDIA H100 GPUs, achieving 375 tokens per second with 113 milliseconds Time-to-First-Token latency. Command A+ also features an optimized tokenizer supporting 48 languages, improving efficiency for non-European languages by up to 20%. It demonstrates strong benchmark performance, including an 85% score on τ²-Bench Telecom and 90% on AIME 25, and offers native citation generation for traceability in regulated industries, alongside a 128K input context window for text and image processing.

Key takeaway

For AI Engineers or MLOps teams prioritizing secure, cost-effective enterprise deployments, Command A+ offers a compelling solution. Its Apache 2.0 license and hardware-efficient W4A4 quantization allow you to run a frontier-grade model on your own infrastructure, reducing vendor lock-in and operational costs. You should evaluate its native citation generation and multimodal capabilities for regulated industry applications, ensuring data privacy and verifiable outputs.

Key insights

Cohere's Command A+ combines sparse MoE architecture with lossless 4-bit quantization for efficient, powerful enterprise AI.

Principles

Method

Command A+ achieves nearly lossless 4-bit quantization by quantizing MoE experts while keeping critical attention pathways at full precision, supplemented by Quantization-Aware Distillation.

In practice

Topics

Best for: CTO, AI Architect, Machine Learning Engineer, AI Engineer, MLOps Engineer, Director of AI/ML

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by VentureBeat.