Feasibility of Indoor Frame-Wise Lidar Semantic Segmentation via Distillation from Visual Foundation Model

2026-04-22 · Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Advanced, extended

Summary

A study investigated the feasibility of frame-wise semantic segmentation for indoor lidar scans using a 2D-to-3D distillation pipeline, adapting the ScaLR framework. This approach leverages Visual Foundation Models (VFMs) like DINOv2 and OneFormer to generate pseudo-labels from camera images, which then supervise a 3D lidar model (WaffleIron) without requiring manual lidar annotations. The evaluation, conducted on indoor SLAM datasets such as NTU-VIRAL, TIERS, M2DGR, and a small manually annotated ITC dataset, demonstrated that the distilled model achieved up to 56% mIoU under pseudo-label evaluation and approximately 36% mIoU with real-label validation. This performance significantly surpassed a supervised baseline (RandLA-Net at 10.7% mIoU) and confirmed the viability of cross-modal distillation for indoor lidar semantic segmentation, despite domain gaps between datasets and VFM robustness.

Key takeaway

For AI Scientists developing real-time indoor 3D scene understanding systems, this research demonstrates a viable path to achieve semantic segmentation without extensive manual lidar annotations. You should consider implementing cross-modal distillation pipelines, leveraging existing Visual Foundation Models to generate pseudo-labels for training 3D backbones. Focus on integrating diverse indoor datasets during distillation to enhance generalization and evaluate the 24-layer WaffleIron configuration for optimal performance-efficiency trade-offs in your applications.

Key insights

Cross-modal distillation from Visual Foundation Models enables effective, label-free indoor lidar semantic segmentation.

Principles

Pseudo-labels can approximate true model behavior.
Combining diverse datasets improves domain generalization.
Deeper networks offer marginal accuracy gains for increased cost.

Method

The ScaLR framework adapts to indoor SLAM datasets by projecting VFM-generated 2D semantic masks onto lidar point clouds to create pseudo-labels, which then supervise a 3D lidar backbone for frame-wise segmentation.

In practice

Use 24-layer WaffleIron for optimal efficiency.
Combine heterogeneous indoor datasets for better generalization.
Employ confidence-based filtering to refine pseudo-labels.

Topics

Lidar Semantic Segmentation
Cross-Modal Distillation
Visual Foundation Models
ScaLR Framework
Pseudo-label Generation

Best for: AI Scientist, Research Scientist, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.