SGFormer++: Semantic Graph Transformer for Incremental 3D Scene Graph Generation

· Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Computer Vision · Depth: Expert, quick

Summary

SGFormer++ is a novel Semantic Graph Transformer designed for 3D scene graph generation (SGG), which converts point cloud scenes into semantic structural graphs where nodes represent objects and edges define their relationships. Addressing the limitations of existing GCN-based methods like over-smoothing, SGFormer++ employs Transformer layers for global message passing. It integrates a Graph Embedding Layer++ for efficient edge-aware global context with linear complexity and a Semantic Injection Layer++ that enhances visual features using linguistic priors from LLMs and VLMs, without adding trainable parameters. For incremental SGG, SGFormer++ incorporates a Spatial-guided Feature Adapter to calibrate predicate features with spatial geometry and a Cascaded Binary Prediction Head to mitigate catastrophic forgetting through classifier expansion and logit distillation. Experiments on the 3DSSG benchmark show SGFormer++ achieves state-of-the-art performance, including a 4.49% absolute improvement in Predicate A@1 in the incremental setting.

Key takeaway

For Machine Learning Engineers developing 3D scene graph generation systems, particularly those requiring incremental learning, you should consider SGFormer++'s architecture. Its Transformer backbone and integration of LLM/VLM linguistic priors offer superior global context and semantic enrichment. Implement its Spatial-guided Feature Adapter and Cascaded Binary Prediction Head to effectively manage new relationship categories and mitigate catastrophic forgetting, significantly improving performance in evolving 3D environments.

Key insights

SGFormer++ uses Transformers and LLM/VLM priors for robust 3D scene graph generation, excelling in incremental learning.

Principles

Method

SGFormer++ integrates a Graph Embedding Layer++ for global context and a Semantic Injection Layer++ for linguistic priors. It uses a Spatial-guided Feature Adapter and Cascaded Binary Prediction Head for incremental SGG.

In practice

Topics

Code references

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.