See & Sniff: Learning Visuo-Olfactory Representations
Summary
A new research initiative introduces SmellNet-V, a scalable visuo-olfactory dataset, and See & Sniff, a self-supervised framework for learning joint visuo-olfactory representations. SmellNet-V addresses the lack of paired visuo-olfactory data by leveraging the insight that odor identity remains largely invariant to visual transformations within a semantic category. This enables the synthetic pairing of smell-only samples with semantically aligned web images, creating a cross-modal benchmark without costly co-collection. The See & Sniff framework utilizes dense local alignment to learn these representations, naturally generating smell saliency maps for spatial grounding of odor sources. It also establishes a pixel-level smell localization task and benchmark. This approach improves smell classification from smell alone by 7% over smell-only baselines and extends to cross-modal retrieval and smell localization, pioneering visuo-olfactory learning in multimodal perception.
Key takeaway
For Machine Learning Engineers developing multimodal perception systems, this work introduces a viable path for integrating olfaction. You should consider utilizing the principle of semantic invariance to synthetically expand scarce cross-modal datasets, as demonstrated by SmellNet-V. This approach allows you to build robust visuo-olfactory models, improving classification and enabling spatial smell localization, even without expensive co-collected data.
Key insights
Odor identity's invariance to visual transformations enables synthetic visuo-olfactory data generation and joint representation learning.
Principles
- Odor identity is invariant to visual transformations within a semantic category.
- Self-supervised learning can align distinct modalities without direct co-collection.
- Dense local alignment facilitates spatial grounding of sensory inputs.
Method
The See & Sniff framework learns joint visuo-olfactory representations using dense local alignment, producing smell saliency maps. It utilizes synthetically paired smell-only samples with web images from the SmellNet-V dataset.
In practice
- Synthetically generate cross-modal datasets from unimodal sources.
- Develop models for smell classification and localization.
- Ground odor sources spatially using smell saliency maps.
Topics
- Visuo-Olfactory Learning
- Multimodal Perception
- Self-Supervised Learning
- SmellNet-V Dataset
- Smell Localization
- Cross-Modal Retrieval
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.