Giter VIP home page Giter VIP logo

soundstorm's Introduction

SoundStorm: Efficient Parallel Audio Generation (wip)

Unofficial Pytorch implementation of SoundStorm, a Parallel Audio Generation out of Google Research.

Currently, we first provide the first version code by ourselves. We directly use a mask-based discrete diffusion to implement this, which enjoys the same process as Google's paper. For model details, please refer to our paper, InsturctTTS: https://arxiv.org/pdf/2301.13662.pdf

We will soon update the second version based on MASKGIT, which keep the same as SoundStorm.

Overview

Following the paper, we use HuBERT to extract semantic tokens, and then using semantic token as condition to predict all of the acoustic tokens in parallel. Different with SoundStrom to use sum operation to combine the multiple codebook, we use shallow u-net to combine different codebook. For AudioCodec, we use the open source AcademiCodec https://github.com/yangdongchao/AcademiCodec

Prepare dataset

Please refer to data_sample folder to understood how to prepare the dataset.

Training

  • Firtsly, prepare your data
  • bash start/start.sh

Inference

  • Firstly, revise evaluation/generate_samples_batch.py based on your model.
  • python generate_samples_batch.py

Reference

@article{yang2023instructtts,
  title={InstructTTS: Modelling Expressive TTS in Discrete Latent Space with Natural Language Style Prompt},
  author={Yang, Dongchao and Liu, Songxiang and Huang, Rongjie and Lei, Guangzhi and Weng, Chao and Meng, Helen and Yu, Dong},
  journal={arXiv preprint arXiv:2301.13662},
  year={2023}
}
@article{google_soundstorm,
  title={SoundStorm: Efficient Parallel Audio Generation},
  author={Zal´an Borsos, Matt Sharifi, Damien Vincent, Eugene Kharitonov, Neil Zeghidour, Marco Tagliasacchi},
  journal={arXiv preprint arXiv:2305},
  year={2023}
}
@article{yang2023hifi,
  title={HiFi-Codec: Group-residual Vector quantization for High Fidelity Audio Codec},
  author={Yang, Dongchao and Liu, Songxiang and Huang, Rongjie and Tian, Jinchuan and Weng, Chao and Zou, Yuexian},
  journal={arXiv preprint arXiv:2305.02765},
  year={2023}
}

soundstorm's People

Contributors

liusongxiang avatar yangdongchao avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.