lina-speech's Introduction

lina-speech (beta)

Exploring "linear attention" for text-to-speech.

It predicts audio codec "à la" MusicGen : delayed residual vector quantizers so that we do not need multiple models.

Featuring RWKV, Mamba, Gated Linear Attention.

Compared to other LM TTS model :

Can be easily pretrained and finetuned on midrange GPUs.
Tiny memory footprint.
Trained on long context (up to 2000 tokens : ~27s).

Models

Model	#Params	Dataset	Checkpoint	Steps	Note
GLA	60M, 130M	Librilight-medium	Download	300k	GPU inference only
Mamba	60M	Librilight-medium	Download	300k	GPU inference only
RWKV v6	60M	LibriTTS	Download	150k	GPU inference only

Installation

Following the linear complexity LM you choose, follow respective instructions first:

For Mamba check the official repo.
For GLA/RWKV inference check flash-linear-attention.
For RWKV training check RWKV-LM

Inference

Download configuration and weights above, then check Inference.ipynb.

TODO

Fix RWKV6 inference and/or switch to FLA implem.
Provide a Datamodule for training (lhotse might also work well).
Implement CFG.
Scale up.

Acknowledgment

The RWKV authors and the community around for carrying high-level truly opensource research.
@SmerkyG for making my life easy at testing cutting edge language model.
@lucidrains for its huge codebase.
@sustcsonglin who made GLA and FLA.
@harrisonvanderbyl fixing RWKV inference.

Cite

@software{lemerle2024linaspeech,
  title  = {LinaSpeech: Exploring "linear attention" for text-to-speech.},
  author = {Lemerle, Théodor},
  url    = {https://github.com/theodorblackbird/lina-speech},
  month  = april,
  year   = {2024}
}

IRCAM

This work is performed in the Analysis/Synthesis team of the STMS Laboratory at IRCAM, and is part of the following project: ANR Exovoices

lina-speech's People

Contributors

Stargazers

Watchers

lina-speech's Issues

Training on custom dataset?

Hi Theodor, this project looks very interesting!

I would really like to try this out on the Norwegian NST dataset.

Can you give me some pointers as to what kind of processing I'd have to do in order to mimic the dataset structure you're using?

Model Scaling, finetuning recipe

Hi @theodorblackbird
Currently, the model sounds good but I think if you scale the model it will get better with picking prosody and timber from prompt and sounds much more natural.
One suggest I can give is to scale the model around 300M parameters and train on the latest 10K hours of huggingface's TTS data : https://huggingface.co/datasets/parler-tts/mls-eng-10k-tags_tagged_10k_generated .
https://github.com/huggingface/dataspeech
I have tested this on VoiceCraft (https://github.com/jasonppy/VoiceCraft) which is also based on a delayed RVQ pattern, I trained 330M Voicecraft on 1k hours of multi-lingual data and it sounds amazing and very natural it has some noise and the voice is not that crisp but that due to lower sample rate of 16000.

mamba2 support

Mamba2 was released - so current version no longer works. To fix you must install v1 with
pip install mamba-ssm==1.2.2

Is it possible to add support for Mamba2?

Congratulations! Model checkpoint on the Hugging Face Hub?

Hi there,

I'm VB, I lead the advocacy efforts for open-source audio at Hugging Face.
Congratulations on such a brilliant checkpoint - Even at 60M the model performs considerably well.

It'd be great if you were to release the data preparation steps along with model checkpoints (the ones used in the demo page).

I'd personally love to scale this over to more data and perhaps test on Multi-lingual datasets like CML-TTS and so on.

More than happy to help you in anyway needed!

Cheers,
VB

Recommend Projects

theodorblackbird / lina-speech Goto Github PK