See & Sniff: Learning Visuo-Olfactory Representations

2026-06-25 · Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Emerging Technologies & Innovation · Depth: Expert, quick

Summary

A new research initiative introduces SmellNet-V, a scalable visuo-olfactory dataset, and See & Sniff, a self-supervised framework for learning joint visuo-olfactory representations. SmellNet-V addresses the lack of paired visuo-olfactory data by leveraging the insight that odor identity remains largely invariant to visual transformations within a semantic category. This enables the synthetic pairing of smell-only samples with semantically aligned web images, creating a cross-modal benchmark without costly co-collection. The See & Sniff framework utilizes dense local alignment to learn these representations, naturally generating smell saliency maps for spatial grounding of odor sources. It also establishes a pixel-level smell localization task and benchmark. This approach improves smell classification from smell alone by 7% over smell-only baselines and extends to cross-modal retrieval and smell localization, pioneering visuo-olfactory learning in multimodal perception.

Key takeaway

For Machine Learning Engineers developing multimodal perception systems, this work introduces a viable path for integrating olfaction. You should consider utilizing the principle of semantic invariance to synthetically expand scarce cross-modal datasets, as demonstrated by SmellNet-V. This approach allows you to build robust visuo-olfactory models, improving classification and enabling spatial smell localization, even without expensive co-collected data.

Key insights

Odor identity's invariance to visual transformations enables synthetic visuo-olfactory data generation and joint representation learning.

Principles

Odor identity is invariant to visual transformations within a semantic category.
Self-supervised learning can align distinct modalities without direct co-collection.
Dense local alignment facilitates spatial grounding of sensory inputs.

Method

The See & Sniff framework learns joint visuo-olfactory representations using dense local alignment, producing smell saliency maps. It utilizes synthetically paired smell-only samples with web images from the SmellNet-V dataset.

In practice

Synthetically generate cross-modal datasets from unimodal sources.
Develop models for smell classification and localization.
Ground odor sources spatially using smell saliency maps.

Topics

Visuo-Olfactory Learning
Multimodal Perception
Self-Supervised Learning
SmellNet-V Dataset
Smell Localization
Cross-Modal Retrieval

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.