DesignSense: A Human Preference Dataset and Reward Modeling Framework for Graphic Layout Generation

2026-03-02 · Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Advanced, extended

Summary

DesignSense introduces a large-scale dataset, DesignSense-10k, comprising 10,235 human-annotated preference pairs for graphic layout evaluation, addressing the gap where existing text-to-image preference models fail to generalize to layout aesthetics. The dataset is curated using a five-stage pipeline that generates visually coherent layout transformations across diverse aspect ratios, involving semantic grouping, layout prediction, filtering, clustering, and VLM-based refinement. Human preferences are captured using a 4-class scheme: "left," "right," "both good," and "both bad." Leveraging this dataset, the authors trained DesignSense, a vision-language model-based classifier, which significantly outperforms existing open-source and proprietary models, achieving a 54.6% improvement in Macro F1 over the strongest proprietary baseline. The model also demonstrates tangible downstream gains in layout generation, improving generator win rates by approximately 3% during RL-based training and providing a 3.6% improvement with inference-time scaling.

Key takeaway

For Computer Vision Engineers developing graphic layout generation models, integrating DesignSense as a preference judge can significantly improve model alignment with human aesthetic preferences. By using this specialized reward model during reinforcement learning or for inference-time candidate selection, you can achieve higher win rates and produce more human-aligned layouts, overcoming the limitations of general-purpose VLMs on spatial arrangement tasks.

Key insights

Specialized human preference data and models are crucial for aligning graphic layout generation with aesthetic judgment.

Principles

Layout preference depends on spatial relationships and compositional balance.
VLM-based models can reason about layout composition and design intent.

Method

A five-stage pipeline (grouping, prediction, filtering, clustering, refinement) generates diverse, high-quality layout pairs for human annotation, capturing preferences with a 4-class scheme.

In practice

Use GPT-4o for semantic grouping of layout elements.
Employ DINO v3 for feature extraction to select diverse layout pairs.

Topics

Graphic Layout Generation
Human Preference Datasets
Reward Modeling
Vision-Language Models
Aesthetic Alignment

Best for: Computer Vision Engineer, Research Scientist, AI Researcher, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.