MambaCount: Efficient Text-guided Open-vocabulary Object Counting with Spatial Sparse State Space Duality Block

2026-06-16 · Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision · Depth: Expert, quick

Summary

MambaCount is an efficient framework designed for Text-guided Open-vocabulary Object Counting (TOOC), a challenging task in dense scenes with large scale variations. Existing TOOC approaches, primarily relying on Transformers, suffer from quadratic complexity with respect to image resolution, limiting their scalability. MambaCount addresses this by leveraging Mamba's linear complexity while overcoming its inherent causal formulation and high entropy in spatial token responses. It introduces the Spatial Sparse State Space Duality (S^4D) block, which reconstructs Mamba's decay dynamics to alleviate causal dependency constraints. Furthermore, a Spatial Token Selection (STS) sub-block reduces unconstrained high entropy, and Multi-Granularity Prototypes (MGP) identify object-like regions for improved cross-modal alignment. MambaCount achieves leading performance among methods without secondary querying, obtaining a test MAE of 12.23 on FSC-147, while maintaining linear complexity.

Key takeaway

For Computer Vision Engineers implementing text-guided open-vocabulary object counting in dense scenes, MambaCount offers a significant advancement. Its linear complexity, compared to Transformers' quadratic scaling, makes it ideal for high-resolution images. You should consider MambaCount for projects where scalability and accurate object enumeration are critical, especially given its leading performance (MAE 12.23 on FSC-147) without secondary querying, improving efficiency and interpretability.

Key insights

MambaCount uses a novel S^4D block to enable efficient, accurate text-guided object counting by overcoming Mamba's spatial limitations.

Principles

Mamba's linear complexity offers scalability over Transformers.
Causal modeling constrains bidirectional spatial dependency.
High entropy in spatial tokens weakens local details.

Method

MambaCount reconstructs Mamba's hidden state decay dynamics, introduces a Spatial Token Selection (STS) sub-block, and designs Multi-Granularity Prototypes (MGP) for object region identification and cross-modal alignment.

In practice

Estimate object counts in dense scenes.
Improve scalability for high-resolution images.
Enhance cross-modal alignment in vision tasks.

Topics

Text-guided Object Counting
Open-vocabulary Vision
Mamba Models
State Space Models
Spatial Sparse State Space Duality (S^4D)
Cross-modal Alignment

Best for: Research Scientist, AI Scientist, Computer Vision Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.