Connecting Speech to Words through Images

2026-06-15 · Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

A novel visually grounded method has been developed to establish a vocabulary of spoken words by connecting them to written words, entirely without explicit textual supervision. This approach first leverages image captioning systems to create a written word vocabulary based on salient visual concepts within images. Subsequently, it identifies spoken utterances whose associated image captions contain these specific words. An unsupervised word discovery technique then aligns these utterances to precisely locate and segment instances of the target spoken words. This process successfully links spoken word segments to their written counterparts. In experiments, this method surpassed a strong neural baseline in spoken word retrieval and keyword spotting tasks, while also offering greater interpretability. The findings confirm its feasibility in English and highlight its potential for low-resource languages lacking transcripts.

Key takeaway

For NLP Engineers developing speech-to-text systems in low-resource languages, this visually grounded approach offers a viable path to build foundational spoken word vocabularies without relying on extensive text transcripts. You should consider integrating image captioning and unsupervised word discovery techniques to establish robust speech-to-word mappings. This method can significantly reduce data annotation burdens and improve interpretability in your keyword spotting and spoken word retrieval applications.

Key insights

Spoken and written word mappings can be learned without text supervision by grounding them in shared visual contexts from images and their descriptions.

Principles

Visual context bridges speech-text gap.
Unsupervised alignment can segment speech.
Interpretability enhances word discovery.

Method

Use image captioning for written vocabulary. Find utterances with matching captions. Apply unsupervised word discovery to align and segment spoken words, linking them to written forms without text.

In practice

Build spoken vocabularies for low-resource languages.
Improve keyword spotting in untranscribed audio.
Enhance spoken word retrieval systems.

Topics

Speech-to-Text
Low-Resource Languages
Unsupervised Learning
Image Captioning
Spoken Word Retrieval
Keyword Spotting

Best for: Research Scientist, AI Scientist, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.