Leveraging Large Vision Model for Multi-UAV Co-perception in Low-Altitude Wireless Networks
Summary
The Base-Station-Helped UAV (BHU) framework is proposed to enhance multi-UAV cooperative perception in low-altitude wireless networks by addressing challenges of massive visual data and communication latency. The framework employs a Top-K selection mechanism to identify and transmit only the most informative pixels from UAV-captured RGB images, significantly reducing data volume and latency. These sparsified images are sent to a ground server via multi-user MIMO (MU-MIMO), where a Swin-large-based MaskDINO encoder extracts Bird's-Eye-View (BEV) features for cooperative fusion and ground vehicle perception. A diffusion model-based deep reinforcement learning (DRL) algorithm jointly optimizes cooperative UAV selection, sparsification ratios, and precoding matrices. Simulations on the Air-Co-Pred dataset demonstrate that BHU improves perception performance by over 5% and reduces communication overhead by 85% compared to traditional CNN-based BEV fusion baselines, offering an effective solution for resource-constrained environments.
Key takeaway
For AI Scientists and Computer Vision Engineers developing multi-UAV perception systems, the BHU framework offers a robust approach to overcome communication bottlenecks. By adopting Top-K sparsification and LVM-based BEV fusion, your systems can achieve over 5% better perception accuracy with an 85% reduction in communication overhead. Consider integrating DDIM-based DRL for dynamic optimization of UAV selection, sparsification ratios, and precoding to maximize utility in resource-constrained low-altitude networks.
Key insights
A novel framework optimizes multi-UAV perception by sparsifying visual data and using LVMs with DRL for efficient communication.
Principles
- Sparsified visual data reduces communication overhead.
- LVMs enhance perception accuracy from reduced data.
- DRL can jointly optimize communication and perception.
Method
The BHU framework uses Top-K pixel selection, transmits sparsified images via MU-MIMO to a ground server, extracts BEV features with a Swin-large MaskDINO encoder, and fuses them. A DDIM-based DRL algorithm optimizes UAV selection, sparsification, and precoding.
In practice
- Implement Top-K selection for UAV image transmission.
- Utilize Swin-large MaskDINO for BEV feature extraction.
- Apply DRL to balance perception utility and latency.
Topics
- Multi-UAV Perception
- Large Vision Models
- Deep Reinforcement Learning
- Communication Efficiency
- Bird's-Eye-View
Best for: Computer Vision Engineer, AI Scientist, Research Scientist, AI Researcher, AI Engineer, Robotics Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.