Zamba2-VL Technical Report
Summary
Zamba2-VL is a new suite of vision-language models (VLMs) based on the Zamba2 hybrid architecture, which integrates Mamba2 state-space layers with a small number of shared transformer blocks. This VLM suite demonstrates competitive performance against leading Transformer-based open-weight VLMs like Molmo2, Qwen3-VL, and InternVL3.5 across various image understanding, reasoning, OCR, grounding, and counting benchmarks. Notably, Zamba2-VL significantly outperforms earlier SSM-based and hybrid VLMs such as VL-Mamba, Cobra, and mmMamba. Inheriting Zamba2's near-linear prefill compute and small, constant recurrent state, Zamba2-VL achieves roughly an order of magnitude lower time-to-first-token (TTFT) compared to Transformer baselines at equivalent parameter scales. This efficiency advantage is most pronounced at the 1.2B and 2.7B scales, making it highly suitable for on-device and edge deployments. Three models (1.2B, 2.7B, and 7B) and inference code are released.
Key takeaway
For Machine Learning Engineers deploying vision-language models on resource-constrained edge devices, Zamba2-VL offers a compelling alternative. You should consider evaluating its 1.2B or 2.7B models, as they deliver an order of magnitude lower time-to-first-token compared to Transformer baselines, significantly improving real-time application responsiveness. This efficiency gain can reduce operational costs and enhance user experience in on-device VLM applications.
Key insights
Zamba2-VL combines Mamba2 and Transformers for competitive VLM performance with significantly lower inference latency, especially on-device.
Principles
- Hybrid architectures can surpass pure SSMs.
- Near-linear prefill compute improves VLM efficiency.
- Smaller models benefit most from TTFT optimizations.
In practice
- Deploy VLMs on edge devices.
- Achieve faster VLM inference.
- Evaluate hybrid VLM architectures.
Topics
- Vision-Language Models
- Hybrid AI Architectures
- Mamba2
- Edge AI
- Inference Optimization
- Time-to-First-Token
Best for: MLOps Engineer, AI Engineer, Computer Vision Engineer, AI Scientist, Machine Learning Engineer, AI Hardware Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.