Giter VIP home page Giter VIP logo

speech-conversion's Introduction

Whisper to Normal Speech Conversion with SC-MelGAN and SC-VQ-VAE

This repository contains the source code for the paper Generative Models for Improved Naturalness, Intelligibility, and Voicing of Whispered Speech. The goal was to adapt MelGAN and VQ-VAE systems to convert whispered speech into normal speech.

The MelGAN code used as basis for this project can be found here.

The VQ-VAE model is based on Deepmind's VQ-VAE implementation (see here), Andrej Karpathy's implementation, and this repo.

The WaveGlow system is a slightly adapted version of the code provided by NVIDIA.

Please visit our demo website for samples.

Structure of this Repository

The repo is structured as follows:

    .
    └── speech-conversion
        ├── melgan             -> Sources for training MelGAN models
        │   └── mel2wav
        ├── vqvae              -> Sources for training VQ-VAE models
        └── waveglow           -> Sources for training WaveGlow models
            └── tacotron2

Dataset

The code is designed to be used with the wTIMIT corpus. The corpus can be downloaded here (Note: Requires authentication). The wTIMIT dataset is sampled at 44 kHz and needs to be resampled to 16 kHz. The 16 kHz setting is hardcoded in several places of this project. Hence, using a different sample rate without any source code modifications will likely lead to errors.

Preparing the Dataset

Create a directory with all samples stored for example in the wavs/ subfolder. You'll need to provide filelists containing the your test and training data. A simple way to create these filelists looks as follows:

ls wavs/*n.WAV | tail -n+10 > train_files.txt
ls wavs/*w.WAV | head -n10 > test_files.txt
ls wavs/*n.WAV | head -n10 > normal_test_files.txt -> normal test data for waveglow

Note that we only grab the whispered utterances (the ones with "w" at the end) for the test set.

Training

See the following scripts for examples on how to train MelGAN, VQ-VAE and WaveGlow models:

  • train_melgan.sh
    • Add your own paths to the variables SAVE_PATH, DATA_PATH, and LOAD_PATH
  • train_vqvae.sh
    • Add your own paths to the variables SAVE_PATH, DATA_PATH, LOAD_PATH, and WG_PATH
  • train_waveglow.sh
    • Create your own config file for the WaveGlow model or use an existing one and point to it via the --config flag
    • Note that the original WaveGlow model is incompatible with the Mel spectrogram features generated for MelGAN and VQ-VAE training
    • Hence using a pretrained WaveGlow model will not yield good results, when spectrogram inputs generated by VQ-VAE are used as input
    • A training script that provides a compatible model can be found in speech-conversion/waveglow/train_melgan_comapt.py and is also referenced in train_waveglow.sh.

Note: The Python scripts need to run with the -m command line flag and without the .py extension (e.g. python -m app.sub1.mod1) due to the relative imports used across the sub-packages.

Inference

Inference can be done with the following scripts:

  • inference_melgan.sh
  • inference_vqvae.sh

speech-conversion's People

Contributors

dwgnr avatar audiodemo avatar

Stargazers

 avatar Ibrahim El-bastawisi avatar  avatar Mo Li avatar  avatar Ali Goodarzi  avatar JHong avatar 兰777 avatar  avatar Devansh Khandekar avatar Mahdi Eslami avatar  avatar  avatar

Watchers

 avatar

speech-conversion's Issues

Training epochs

How many epochs are needed for training melgan? With the default value of 3000 for the number of epochs, the result generated wav only contains noise. Also could you provide your pretrained models?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.