Can We Build Scene Graphs, Not Classify Them? FlowSG: Progressive Image-Conditioned Scene Graph Generation with Flow Matching
Summary
FlowSG is a novel Scene Graph Generation (SGG) framework that redefines SGG as a progressive, generative task rather than a one-shot classification problem. It operates by modeling continuous-time transport on a hybrid discrete-continuous state, starting from a noised graph and iteratively refining it to synthesize objects (nodes) and predicates (edges) conditioned on an image. The model leverages a VQ-VAE to quantize scene graph features into compact tokens and employs a graph Transformer to predict a conditional velocity field for continuous geometry (bounding boxes) while updating discrete posteriors for categorical tokens (object features and predicate labels). FlowSG integrates flow-matching losses for geometry with a discrete-flow objective for tokens, enabling few-step inference and compatibility with standard detectors. Experiments on Visual Genome (VG) and Panoptic SGG (PSG) datasets demonstrate consistent improvements in predicate R/mR and graph-level metrics, outperforming the state-of-the-art USG-Par by approximately 3 points.
Key takeaway
For research scientists developing advanced scene graph generation models, FlowSG's generative, progressive refinement approach offers a robust alternative to traditional one-shot classification. You should consider adopting hybrid flow matching techniques to jointly evolve discrete semantic tokens and continuous geometric attributes, as this method has shown superior performance and global consistency compared to prior deterministic pipelines. Experiment with similar time-conditioned graph transformers and VQ-VAE tokenization to enhance your model's ability to correct errors and maintain structural coherence.
Key insights
FlowSG progressively generates scene graphs by jointly denoising discrete semantics and continuous geometry via hybrid flow matching.
Principles
- SGG benefits from progressive, iterative refinement over one-shot classification.
- Hybrid discrete-continuous state spaces enable richer graph generation.
- Semantic and geometric cues should evolve jointly for consistency.
Method
FlowSG uses a VQ-VAE for tokenization, then a graph Transformer with relation-modulated attention and flow-conditioned message aggregation to iteratively refine a noised graph via continuous flow matching for boxes and discrete flow for semantics.
In practice
- Quantize visual features into compact, language-aligned codes.
- Initialize object categories as priors, mask relations and appearances.
- Use a time-conditioned graph transformer for iterative denoising.
Topics
- Scene Graph Generation
- Flow Matching
- Hybrid Generative Model
- Graph Transformer
- VQ-VAE
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.