DesignSense: A Human Preference Dataset and Reward Modeling Framework for Graphic Layout Generation
Summary
DesignSense introduces a large-scale dataset, DesignSense-10k, comprising 10,235 human-annotated preference pairs for graphic layout evaluation, addressing the gap where existing text-to-image preference models fail to generalize to layout aesthetics. The dataset is curated using a five-stage pipeline that generates visually coherent layout transformations across diverse aspect ratios, involving semantic grouping, layout prediction, filtering, clustering, and VLM-based refinement. Human preferences are captured using a 4-class scheme: "left," "right," "both good," and "both bad." Leveraging this dataset, the authors trained DesignSense, a vision-language model-based classifier, which significantly outperforms existing open-source and proprietary models, achieving a 54.6% improvement in Macro F1 over the strongest proprietary baseline. The model also demonstrates tangible downstream gains in layout generation, improving generator win rates by approximately 3% during RL-based training and providing a 3.6% improvement with inference-time scaling.
Key takeaway
For Computer Vision Engineers developing graphic layout generation models, integrating DesignSense as a preference judge can significantly improve model alignment with human aesthetic preferences. By using this specialized reward model during reinforcement learning or for inference-time candidate selection, you can achieve higher win rates and produce more human-aligned layouts, overcoming the limitations of general-purpose VLMs on spatial arrangement tasks.
Key insights
Specialized human preference data and models are crucial for aligning graphic layout generation with aesthetic judgment.
Principles
- Layout preference depends on spatial relationships and compositional balance.
- VLM-based models can reason about layout composition and design intent.
Method
A five-stage pipeline (grouping, prediction, filtering, clustering, refinement) generates diverse, high-quality layout pairs for human annotation, capturing preferences with a 4-class scheme.
In practice
- Use GPT-4o for semantic grouping of layout elements.
- Employ DINO v3 for feature extraction to select diverse layout pairs.
Topics
- Graphic Layout Generation
- Human Preference Datasets
- Reward Modeling
- Vision-Language Models
- Aesthetic Alignment
Best for: Computer Vision Engineer, Research Scientist, AI Researcher, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.