FoodSense: A Multisensory Food Dataset and Benchmark for Predicting Taste, Smell, Texture, and Sound from Images
Summary
FoodSense is a new human-annotated dataset designed for cross-sensory inference from food images, containing 66,842 participant-image pairs across 2,987 unique food images. Unlike prior vision-language research focused on recognition tasks like meal identification or nutrition estimation, FoodSense enables models to predict taste, smell, texture, and sound. Each pair includes numeric ratings (1-5) and free-text descriptors for these four sensory dimensions. The dataset also features image-grounded reasoning traces, generated by a large language model, which expand short human annotations into visual justifications. Researchers used these annotations to train FoodSense-VL, a vision-language benchmark model capable of producing multisensory ratings and grounded explanations directly from food images.
Key takeaway
For research scientists developing multimodal AI, FoodSense offers a unique resource to move beyond basic image recognition to predict complex human sensory experiences. You should consider integrating this dataset to train models capable of both predicting and explaining multisensory perceptions, which could reveal limitations in current evaluation metrics for visually sensory inference.
Key insights
FoodSense dataset enables AI models to infer multisensory food experiences from images, bridging cognitive science and multimodal AI.
Principles
- Humans infer multisensory data from food images.
- Visual justifications enhance sensory expectation models.
Method
A large language model expands human annotations into image-grounded reasoning traces, conditioned on image, ratings, and descriptors, to train a vision-language model for multisensory prediction.
In practice
- Train models to predict taste, smell, texture, sound.
- Generate visual justifications for sensory predictions.
Topics
- FoodSense Dataset
- Multisensory Perception
- Cross-Sensory Inference
- Vision-Language Models
- Image-Grounded Reasoning
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.