FoodSense: A Multisensory Food Dataset and Benchmark for Predicting Taste, Smell, Texture, and Sound from Images

· Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Advanced, quick

Summary

FoodSense is a new human-annotated dataset designed for cross-sensory inference from food images, containing 66,842 participant-image pairs across 2,987 unique food images. Unlike prior vision-language research focused on recognition tasks like meal identification or nutrition estimation, FoodSense enables models to predict taste, smell, texture, and sound. Each pair includes numeric ratings (1-5) and free-text descriptors for these four sensory dimensions. The dataset also features image-grounded reasoning traces, generated by a large language model, which expand short human annotations into visual justifications. Researchers used these annotations to train FoodSense-VL, a vision-language benchmark model capable of producing multisensory ratings and grounded explanations directly from food images.

Key takeaway

For research scientists developing multimodal AI, FoodSense offers a unique resource to move beyond basic image recognition to predict complex human sensory experiences. You should consider integrating this dataset to train models capable of both predicting and explaining multisensory perceptions, which could reveal limitations in current evaluation metrics for visually sensory inference.

Key insights

FoodSense dataset enables AI models to infer multisensory food experiences from images, bridging cognitive science and multimodal AI.

Principles

Method

A large language model expands human annotations into image-grounded reasoning traces, conditioned on image, ratings, and descriptors, to train a vision-language model for multisensory prediction.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.