Connecting Speech to Words through Images
Summary
A novel visually grounded method has been developed to establish a vocabulary of spoken words by connecting them to written words, entirely without explicit textual supervision. This approach first leverages image captioning systems to create a written word vocabulary based on salient visual concepts within images. Subsequently, it identifies spoken utterances whose associated image captions contain these specific words. An unsupervised word discovery technique then aligns these utterances to precisely locate and segment instances of the target spoken words. This process successfully links spoken word segments to their written counterparts. In experiments, this method surpassed a strong neural baseline in spoken word retrieval and keyword spotting tasks, while also offering greater interpretability. The findings confirm its feasibility in English and highlight its potential for low-resource languages lacking transcripts.
Key takeaway
For NLP Engineers developing speech-to-text systems in low-resource languages, this visually grounded approach offers a viable path to build foundational spoken word vocabularies without relying on extensive text transcripts. You should consider integrating image captioning and unsupervised word discovery techniques to establish robust speech-to-word mappings. This method can significantly reduce data annotation burdens and improve interpretability in your keyword spotting and spoken word retrieval applications.
Key insights
Spoken and written word mappings can be learned without text supervision by grounding them in shared visual contexts from images and their descriptions.
Principles
- Visual context bridges speech-text gap.
- Unsupervised alignment can segment speech.
- Interpretability enhances word discovery.
Method
Use image captioning for written vocabulary. Find utterances with matching captions. Apply unsupervised word discovery to align and segment spoken words, linking them to written forms without text.
In practice
- Build spoken vocabularies for low-resource languages.
- Improve keyword spotting in untranscribed audio.
- Enhance spoken word retrieval systems.
Topics
- Speech-to-Text
- Low-Resource Languages
- Unsupervised Learning
- Image Captioning
- Spoken Word Retrieval
- Keyword Spotting
Best for: Research Scientist, AI Scientist, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.