This was done as part of the final project for my course CSE 291: Statistical NLP.
The idea is to build acoustic embeddings where sentences that are acoustically similar are mapped. This can be extended for words and phonemens as well.
The proposed model uses a Variational Autoencoder to model the acoustic embedding of the given speech data