Contrastive-SDXL: Annotation-Preserving Night-Time Augmentation for Pedestrian Detection

2026-05-19 · Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, quick

Summary

Contrastive-SDXL is a novel day-to-night augmentation framework designed to improve night-time pedestrian detection, a task challenged by limited labeled night-time data. Built upon SDXL-Turbo and fine-tuned with Low-Rank Adaptation (LoRA), this framework addresses the critical need for preserving detector-relevant objects and local semantic structure during image-to-image translation. It introduces a patch-wise semantic contrastive loss, guided by a pretrained DINOv2 encoder, to maintain semantic correspondence between daytime inputs and translated night-time images. Additionally, multi-level DINOv2 self-attention maps ensure both local and global semantic consistency, complemented by an explicit object consistency loss for pedestrian preservation. Contrastive-SDXL generates realistic night-time images, achieving a Frechet Inception Distance (FID) of 22.5, and enables detectors trained with its synthetic data to reduce miss rates by 6-7% compared to daytime-only baselines, closely matching performance with real night-time data.

Key takeaway

For research scientists developing pedestrian detection systems, Contrastive-SDXL offers a robust method to overcome night-time data scarcity. You should consider integrating consistency-driven diffusion augmentation techniques, particularly those leveraging semantic contrastive loss and multi-level attention, to generate high-fidelity synthetic data. This approach can significantly reduce miss rates and improve detector performance in safety-critical applications, approaching the efficacy of real night-time datasets.

Key insights

Consistency-driven diffusion augmentation effectively supports safety-critical night-time pedestrian detection by preserving semantic details.

Principles

Semantic consistency is crucial for synthetic data.
Patch-wise contrastive loss enhances domain translation.
Multi-level attention maps ensure global and local consistency.

Method

Contrastive-SDXL uses SDXL-Turbo, LoRA fine-tuning, a DINOv2-guided patch-wise semantic contrastive loss, and multi-level DINOv2 self-attention maps to generate annotation-preserving night-time images.

In practice

Use DINOv2 for semantic guidance.
Apply LoRA for efficient model fine-tuning.
Integrate object consistency loss for critical elements.

Topics

Contrastive-SDXL
Night-time Pedestrian Detection
Latent Diffusion Models
SDXL-Turbo
Low-Rank Adaptation

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.