Does Appearance Help? A Systematic Study of Image-Based Re-Identification in Online 3D Multi-Pedestrian Tracking

· Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

A systematic study investigated the integration of image-based Re-Identification (ReID) into online 3D Multi-Pedestrian Tracking (MPT) for mobile robots, addressing limitations of LiDAR-only systems in crowded environments. Researchers from the University of Coimbra developed a lightweight projection-based framework to decouple geometric and appearance modeling. Their analysis of feature extraction architectures, including lightweight CNNs (ResNet-18, MobileNetV2/V3-Small) and Vision Transformers, revealed that naive linear fusion of appearance and motion costs degraded performance on the KITTI dataset. Conversely, a cascaded matching strategy successfully recovered occluded tracks, improving identity consistency without precision loss. Lightweight CNNs and ReID-specific networks like MGN offered the best accuracy-latency trade-off, with MGN (128-dim) achieving 79% mAP at over 150 FPS. Domain adaptation through fine-tuning on KITTI-ReID was critical, recovering 10.7 percentage points in mAP. While ReID introduces latency (from 57ms to over 100ms), a cascaded approach with efficient appearance modeling (e.g., EMA for MobileNetV2 achieving 38.02 HOTA with 162 ID switches at 114ms) provides valuable track recovery.

Key takeaway

For robotics engineers deploying 3D Multi-Pedestrian Tracking in crowded human environments, you should adopt a cascaded ReID association strategy. This approach, using lightweight CNNs like MobileNetV2 and fine-tuned on domain-specific data, effectively recovers occluded tracks and maintains identity consistency. While ReID introduces latency, this trade-off is justified for critical applications requiring enhanced safety and interaction continuity. Prioritize EMA or the latest embedding for appearance memory to balance performance and computational cost.

Key insights

Careful integration of image-based ReID into 3D MPT enhances identity consistency, especially with cascaded matching and domain-adapted lightweight models.

Principles

Method

A projection-based framework extracts 2D RoI crops from 3D detections, encodes them via a ReID network, and fuses motion (3D GIoU) and appearance (cosine distance) using a cascaded matching strategy.

In practice

Topics

Best for: Computer Vision Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, Robotics Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.