Feasibility of Indoor Frame-Wise Lidar Semantic Segmentation via Distillation from Visual Foundation Model

· Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Advanced, extended

Summary

A study investigated the feasibility of frame-wise semantic segmentation for indoor lidar scans using a 2D-to-3D distillation pipeline, adapting the ScaLR framework. This approach leverages Visual Foundation Models (VFMs) like DINOv2 and OneFormer to generate pseudo-labels from camera images, which then supervise a 3D lidar model (WaffleIron) without requiring manual lidar annotations. The evaluation, conducted on indoor SLAM datasets such as NTU-VIRAL, TIERS, M2DGR, and a small manually annotated ITC dataset, demonstrated that the distilled model achieved up to 56% mIoU under pseudo-label evaluation and approximately 36% mIoU with real-label validation. This performance significantly surpassed a supervised baseline (RandLA-Net at 10.7% mIoU) and confirmed the viability of cross-modal distillation for indoor lidar semantic segmentation, despite domain gaps between datasets and VFM robustness.

Key takeaway

For AI Scientists developing real-time indoor 3D scene understanding systems, this research demonstrates a viable path to achieve semantic segmentation without extensive manual lidar annotations. You should consider implementing cross-modal distillation pipelines, leveraging existing Visual Foundation Models to generate pseudo-labels for training 3D backbones. Focus on integrating diverse indoor datasets during distillation to enhance generalization and evaluate the 24-layer WaffleIron configuration for optimal performance-efficiency trade-offs in your applications.

Key insights

Cross-modal distillation from Visual Foundation Models enables effective, label-free indoor lidar semantic segmentation.

Principles

Method

The ScaLR framework adapts to indoor SLAM datasets by projecting VFM-generated 2D semantic masks onto lidar point clouds to create pseudo-labels, which then supervise a 3D lidar backbone for frame-wise segmentation.

In practice

Topics

Best for: AI Scientist, Research Scientist, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.