Leveraging Large Vision Model for Multi-UAV Co-perception in Low-Altitude Wireless Networks

2026-03-19 · Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Internet of Things (IoT) & Connected Devices · Depth: Advanced, extended

Summary

The Base-Station-Helped UAV (BHU) framework is proposed to enhance multi-UAV cooperative perception in low-altitude wireless networks by addressing challenges of massive visual data and communication latency. The framework employs a Top-K selection mechanism to identify and transmit only the most informative pixels from UAV-captured RGB images, significantly reducing data volume and latency. These sparsified images are sent to a ground server via multi-user MIMO (MU-MIMO), where a Swin-large-based MaskDINO encoder extracts Bird's-Eye-View (BEV) features for cooperative fusion and ground vehicle perception. A diffusion model-based deep reinforcement learning (DRL) algorithm jointly optimizes cooperative UAV selection, sparsification ratios, and precoding matrices. Simulations on the Air-Co-Pred dataset demonstrate that BHU improves perception performance by over 5% and reduces communication overhead by 85% compared to traditional CNN-based BEV fusion baselines, offering an effective solution for resource-constrained environments.

Key takeaway

For AI Scientists and Computer Vision Engineers developing multi-UAV perception systems, the BHU framework offers a robust approach to overcome communication bottlenecks. By adopting Top-K sparsification and LVM-based BEV fusion, your systems can achieve over 5% better perception accuracy with an 85% reduction in communication overhead. Consider integrating DDIM-based DRL for dynamic optimization of UAV selection, sparsification ratios, and precoding to maximize utility in resource-constrained low-altitude networks.

Key insights

A novel framework optimizes multi-UAV perception by sparsifying visual data and using LVMs with DRL for efficient communication.

Principles

Sparsified visual data reduces communication overhead.
LVMs enhance perception accuracy from reduced data.
DRL can jointly optimize communication and perception.

Method

The BHU framework uses Top-K pixel selection, transmits sparsified images via MU-MIMO to a ground server, extracts BEV features with a Swin-large MaskDINO encoder, and fuses them. A DDIM-based DRL algorithm optimizes UAV selection, sparsification, and precoding.

In practice

Implement Top-K selection for UAV image transmission.
Utilize Swin-large MaskDINO for BEV feature extraction.
Apply DRL to balance perception utility and latency.

Topics

Multi-UAV Perception
Large Vision Models
Deep Reinforcement Learning
Communication Efficiency
Bird's-Eye-View

Best for: Computer Vision Engineer, AI Scientist, Research Scientist, AI Researcher, AI Engineer, Robotics Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.