Giter VIP home page Giter VIP logo

lina-speech's Introduction

lina-speech (beta)

Exploring "linear attention" for text-to-speech.

It predicts audio codec "à la" MusicGen : delayed residual vector quantizers so that we do not need multiple models.

Featuring RWKV, Mamba, Gated Linear Attention.

Compared to other LM TTS model :

  • Can be easily pretrained and finetuned on midrange GPUs.
  • Tiny memory footprint.
  • Trained on long context (up to 2000 tokens : ~27s).

Models

Model #Params Dataset Checkpoint Steps Note
GLA 60M, 130M Librilight-medium Download 300k GPU inference only
Mamba 60M Librilight-medium Download 300k GPU inference only
RWKV v6 60M LibriTTS Download 150k GPU inference only

Installation

Following the linear complexity LM you choose, follow respective instructions first:

Inference

Download configuration and weights above, then check Inference.ipynb.

TODO

  • Fix RWKV6 inference and/or switch to FLA implem.
  • Provide a Datamodule for training (lhotse might also work well).
  • Implement CFG.
  • Scale up.

Acknowledgment

  • The RWKV authors and the community around for carrying high-level truly opensource research.
  • @SmerkyG for making my life easy at testing cutting edge language model.
  • @lucidrains for its huge codebase.
  • @sustcsonglin who made GLA and FLA.
  • @harrisonvanderbyl fixing RWKV inference.

Cite

@software{lemerle2024linaspeech,
  title  = {LinaSpeech: Exploring "linear attention" for text-to-speech.},
  author = {Lemerle, Théodor},
  url    = {https://github.com/theodorblackbird/lina-speech},
  month  = april,
  year   = {2024}
}

IRCAM

This work is performed in the Analysis/Synthesis team of the STMS Laboratory at IRCAM, and is part of the following project: ANR Exovoices

lina-speech's People

Contributors

theodorblackbird avatar harrisonvanderbyl avatar

Stargazers

Scottish_Fold007 avatar Ina299 avatar Jason Yang avatar  avatar  avatar Tsai Meng-Ting avatar Stan Kirdey avatar Etienne Balit avatar Zhenguo LIU avatar  avatar  avatar ERIK avatar Abouvier avatar Ewald Enzinger avatar Ziyang Ma avatar Yuan-Man avatar Pierre Guillot avatar  avatar johnpaulbin avatar Jp avatar Samuel Delalez avatar liuhuang31 avatar  avatar  avatar  avatar  avatar Nam Pham avatar  avatar May Kirihara avatar  avatar Lovemefan avatar  avatar 彭震东 avatar  avatar wblgers avatar zio avatar Andrew Kalek avatar MaxMax avatar  avatar OpenMOSE avatar Cahya Wirawan avatar JingYu avatar Hanif Leoputera Lim avatar  avatar  avatar Fu Guanyu avatar  avatar Deon avatar  avatar Bao-Sinh Nguyen avatar  avatar Ganzorig Batnasan avatar  avatar Ashwin Sankar avatar Tera2Space avatar Serhiy Stetskovych avatar Erdene-Ochir Tuguldur avatar  avatar Perferic avatar Daria Diatlova avatar Oleg avatar Alex avatar HAESUNG JEON (chad.plus) avatar Nick Fisher avatar Jing-Yi Li avatar tomato avatar Qoboty avatar Yoshiki Masuyama avatar Akmal avatar Michael Su avatar Sandalots avatar Vaibhav Srivastav avatar Paul Carter avatar Nickolay V. Shmyrev avatar Dong Zhang avatar Rishabh Varshney avatar suzhenghang avatar Sang-Hoon Lee avatar Rishikesh (ऋषिकेश) avatar  avatar Amirreza salimi avatar Sofian Mejjoute avatar Irving.Gao avatar dlxj avatar  avatar 顾真牛 avatar 现磨豆浆 avatar  avatar Rolex avatar percent73 avatar  avatar

Watchers

Henrik Lied avatar Serhiy Stetskovych avatar  avatar Nickolay V. Shmyrev avatar Rishikesh (ऋषिकेश) avatar suzhenghang avatar  avatar MaxMax avatar  avatar Vaibhav Srivastav avatar Kostas Georgiou avatar Vector Ventures avatar

lina-speech's Issues

Training on custom dataset?

Hi Theodor, this project looks very interesting!

I would really like to try this out on the Norwegian NST dataset.

Can you give me some pointers as to what kind of processing I'd have to do in order to mimic the dataset structure you're using?

Model Scaling, finetuning recipe

Hi @theodorblackbird
Currently, the model sounds good but I think if you scale the model it will get better with picking prosody and timber from prompt and sounds much more natural.
One suggest I can give is to scale the model around 300M parameters and train on the latest 10K hours of huggingface's TTS data : https://huggingface.co/datasets/parler-tts/mls-eng-10k-tags_tagged_10k_generated .
https://github.com/huggingface/dataspeech
I have tested this on VoiceCraft (https://github.com/jasonppy/VoiceCraft) which is also based on a delayed RVQ pattern, I trained 330M Voicecraft on 1k hours of multi-lingual data and it sounds amazing and very natural it has some noise and the voice is not that crisp but that due to lower sample rate of 16000.

mamba2 support

Mamba2 was released - so current version no longer works. To fix you must install v1 with
pip install mamba-ssm==1.2.2

Is it possible to add support for Mamba2?

Congratulations! Model checkpoint on the Hugging Face Hub?

Hi there,

I'm VB, I lead the advocacy efforts for open-source audio at Hugging Face.
Congratulations on such a brilliant checkpoint - Even at 60M the model performs considerably well.

It'd be great if you were to release the data preparation steps along with model checkpoints (the ones used in the demo page).

I'd personally love to scale this over to more data and perhaps test on Multi-lingual datasets like CML-TTS and so on.

More than happy to help you in anyway needed!

Cheers,
VB

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.