FoodSense: A Multisensory Food Dataset and Benchmark for Predicting Taste, Smell, Texture, and Sound from Images

2026-04-15 · Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Advanced, quick

Summary

FoodSense is a new human-annotated dataset designed for cross-sensory inference from food images, containing 66,842 participant-image pairs across 2,987 unique food images. Unlike prior vision-language research focused on recognition tasks like meal identification or nutrition estimation, FoodSense enables models to predict taste, smell, texture, and sound. Each pair includes numeric ratings (1-5) and free-text descriptors for these four sensory dimensions. The dataset also features image-grounded reasoning traces, generated by a large language model, which expand short human annotations into visual justifications. Researchers used these annotations to train FoodSense-VL, a vision-language benchmark model capable of producing multisensory ratings and grounded explanations directly from food images.

Key takeaway

For research scientists developing multimodal AI, FoodSense offers a unique resource to move beyond basic image recognition to predict complex human sensory experiences. You should consider integrating this dataset to train models capable of both predicting and explaining multisensory perceptions, which could reveal limitations in current evaluation metrics for visually sensory inference.

Key insights

FoodSense dataset enables AI models to infer multisensory food experiences from images, bridging cognitive science and multimodal AI.

Principles

Humans infer multisensory data from food images.
Visual justifications enhance sensory expectation models.

Method

A large language model expands human annotations into image-grounded reasoning traces, conditioned on image, ratings, and descriptors, to train a vision-language model for multisensory prediction.

In practice

Train models to predict taste, smell, texture, sound.
Generate visual justifications for sensory predictions.

Topics

FoodSense Dataset
Multisensory Perception
Cross-Sensory Inference
Vision-Language Models
Image-Grounded Reasoning

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.