Learning Word-like Units from Joint Audio-Visual Analysis
created by David Harwath and James R. Glass in Massachusetts Institute of Technology
- Goal: A method for discorvering word-like acoustic units in the continuous speech signal and grounding them to semantically relevant image regions from given a collection of images and spoken audio captions.