Learning Word-like Units from Joint Audio-Visual Analysis

Posted on 2017-09-29 | In Note

created by David Harwath and James R. Glass in Massachusetts Institute of Technology

Goal: A method for discorvering word-like acoustic units in the continuous speech signal and grounding them to semantically relevant image regions from given a collection of images and spoken audio captions.

He Guoxiu

Some notes for interesting papers, tutorials for useful tools, and inspire for life.

Github