Can We Build Scene Graphs, Not Classify Them? FlowSG: Progressive Image-Conditioned Scene Graph Generation with Flow Matching

2026-04-22 · Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision · Depth: Expert, long

Summary

FlowSG is a novel Scene Graph Generation (SGG) framework that redefines SGG as a progressive, generative task rather than a one-shot classification problem. It operates by modeling continuous-time transport on a hybrid discrete-continuous state, starting from a noised graph and iteratively refining it to synthesize objects (nodes) and predicates (edges) conditioned on an image. The model leverages a VQ-VAE to quantize scene graph features into compact tokens and employs a graph Transformer to predict a conditional velocity field for continuous geometry (bounding boxes) while updating discrete posteriors for categorical tokens (object features and predicate labels). FlowSG integrates flow-matching losses for geometry with a discrete-flow objective for tokens, enabling few-step inference and compatibility with standard detectors. Experiments on Visual Genome (VG) and Panoptic SGG (PSG) datasets demonstrate consistent improvements in predicate R/mR and graph-level metrics, outperforming the state-of-the-art USG-Par by approximately 3 points.

Key takeaway

For research scientists developing advanced scene graph generation models, FlowSG's generative, progressive refinement approach offers a robust alternative to traditional one-shot classification. You should consider adopting hybrid flow matching techniques to jointly evolve discrete semantic tokens and continuous geometric attributes, as this method has shown superior performance and global consistency compared to prior deterministic pipelines. Experiment with similar time-conditioned graph transformers and VQ-VAE tokenization to enhance your model's ability to correct errors and maintain structural coherence.

Key insights

FlowSG progressively generates scene graphs by jointly denoising discrete semantics and continuous geometry via hybrid flow matching.

Principles

SGG benefits from progressive, iterative refinement over one-shot classification.
Hybrid discrete-continuous state spaces enable richer graph generation.
Semantic and geometric cues should evolve jointly for consistency.

Method

FlowSG uses a VQ-VAE for tokenization, then a graph Transformer with relation-modulated attention and flow-conditioned message aggregation to iteratively refine a noised graph via continuous flow matching for boxes and discrete flow for semantics.

In practice

Quantize visual features into compact, language-aligned codes.
Initialize object categories as priors, mask relations and appearances.
Use a time-conditioned graph transformer for iterative denoising.

Topics

Scene Graph Generation
Flow Matching
Hybrid Generative Model
Graph Transformer
VQ-VAE

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.