Contrastive-SDXL: Annotation-Preserving Night-Time Augmentation for Pedestrian Detection
Summary
Contrastive-SDXL is a novel day-to-night augmentation framework designed to improve night-time pedestrian detection, a task challenged by limited labeled night-time data. Built upon SDXL-Turbo and fine-tuned with Low-Rank Adaptation (LoRA), this framework addresses the critical need for preserving detector-relevant objects and local semantic structure during image-to-image translation. It introduces a patch-wise semantic contrastive loss, guided by a pretrained DINOv2 encoder, to maintain semantic correspondence between daytime inputs and translated night-time images. Additionally, multi-level DINOv2 self-attention maps ensure both local and global semantic consistency, complemented by an explicit object consistency loss for pedestrian preservation. Contrastive-SDXL generates realistic night-time images, achieving a Frechet Inception Distance (FID) of 22.5, and enables detectors trained with its synthetic data to reduce miss rates by 6-7% compared to daytime-only baselines, closely matching performance with real night-time data.
Key takeaway
For research scientists developing pedestrian detection systems, Contrastive-SDXL offers a robust method to overcome night-time data scarcity. You should consider integrating consistency-driven diffusion augmentation techniques, particularly those leveraging semantic contrastive loss and multi-level attention, to generate high-fidelity synthetic data. This approach can significantly reduce miss rates and improve detector performance in safety-critical applications, approaching the efficacy of real night-time datasets.
Key insights
Consistency-driven diffusion augmentation effectively supports safety-critical night-time pedestrian detection by preserving semantic details.
Principles
- Semantic consistency is crucial for synthetic data.
- Patch-wise contrastive loss enhances domain translation.
- Multi-level attention maps ensure global and local consistency.
Method
Contrastive-SDXL uses SDXL-Turbo, LoRA fine-tuning, a DINOv2-guided patch-wise semantic contrastive loss, and multi-level DINOv2 self-attention maps to generate annotation-preserving night-time images.
In practice
- Use DINOv2 for semantic guidance.
- Apply LoRA for efficient model fine-tuning.
- Integrate object consistency loss for critical elements.
Topics
- Contrastive-SDXL
- Night-time Pedestrian Detection
- Latent Diffusion Models
- SDXL-Turbo
- Low-Rank Adaptation
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.