kdaip / stabletts Goto Github PK

View Code? Open in Web Editor NEW

305.0 305.0 32.0 2.27 MB

Next-generation TTS model using flow-matching and DiT, inspired by Stable Diffusion 3

License: MIT License

Python 99.88% Jupyter Notebook 0.12%

stabletts's People

Stargazers

Watchers

stabletts's Issues

Question about voice speaker Style

I am trying to train a TTS but I am wondering about the style of the speakers? My dataset contains multiple speakers with different speaking styles. Does the model retain the style for each voice or it uses only one style or it depends on the refer audio. For example In my dataset it contains Indian speaker who pauses nervously in conversation. When i train it with all the dataset and use one audio from that speaker and infer will it inhabit the nervous speaking style? Please I dearly wait for your response and thanks for this great repo

推理的声音质量？

很棒的项目，我训练后可以正常推理，发音也正常。
但和训练素材对比，音质听起来不是很明亮和清脆，（我确认不是训练素材质量问题）。

检查了训练素材音频采样率和配置保持一致 44100 。
如何改善推理的音质呢？
再次感谢~

请问有人在树莓派（rpi4）上运行起来么？

依赖pytorch估计就跑不起来了吧，有人搞起来了没？

Time required for the training of current pretrained models

I discovered this project and is pretty amazing the results provided, I saw that it got updated to get support for the Japanese language, and that gave me the curiosity of how many epochs or hours of training were required for the current pretrained models provided, so if is possible to know, I'll be very thankful since I'm gonna train Japanese from scratch and I would like to have some idea of the amount of time/hardware I'll require

Thanks!

the timbre for zero-shot seems to be not good

concat.zip
I don't know whether it supports zero-shot

能解决中文的多音字问题吗？

NEW MODEL RELEASE DATE?

Please I have been working with your repo for while now and it is create and fast. I love it a lot. Please the model you release fails to pronounce some words correctly. Please when will you release the second Model. Thanks a lot for providing this repo to the community

ValueError: not enough values to unpack (expected 2, got 1)

python train.py

加载自定义词典成功
加载自定义词典成功
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0425 17:05:53.758841 4566 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=187465152
Traceback (most recent call last):
File "train.py", line 106, in
torch.multiprocessing.spawn(train, args=(world_size,), nprocs=world_size, join=True)
File "/usr/local/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 246, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
File "/usr/local/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 202, in start_processes
while not context.join():
File "/usr/local/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 145, in join
raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with signal SIGBUS

env

pip list | grep torch

apex 0.1+2a4864d.abi0.dtk2310.torch2.1
torch 2.1.0a0+git793d2b5.abi0.dtk2310
torchaudio 2.1.2+4b32183.abi0.dtk2310.torch2.1.0a0
torchvision 0.16.0+git267eff6.abi0.dtk2310.torch2.1.0

How is the performance of inference speed?

Are there any specific data, in terms of performance on CPU and GPU?

Is it intended to be Zero-shot TTS

Hi @KdaiP nice work, just like to know is this architecture is intended to support zero-shot TTS or normal multi-speaker kind of TTS,

Congratulations on the repo!

Hey hey @KdaiP,

Thanks for open-sourcing your implementation. I'm VB, I work in the open source audio team at Hugging Face. I'd love to know more and see how we can potentially help you with your experiments or share some of the learnings.

If you are interested then feel free to ping me at vaibhav[at]hf[dot]co

Looking forward to the release of the checkpoints!

Cheers,
VB

Is there a trick to stabilize training?

Hi! Thanks for open-sourcing your work! I like the idea, so I tried copying the CFM decoder and use it in my TTS setup. First I had some issues with NaN values after the attention in the estimator was computed for all the values that were padding. I fixed this issue using masked_fill with the x_mask instead of just multiplying with the x_mask, although I'm not sure why that was necessary for me.

But now that I could run a forward pass and calculate the CFM loss and it was not NaN, I thought it's good to go. However more problems came up. No matter what I tried, the CFM decoder would always produce a couple of NaN values after its first update during training. Do you have any idea what could cause this? Is there any specific thing I need to do to stabilize the training? The setup I am using works very well for lots of other architectures, including e.g. the normalizing flow decoder of PortaSpeech, which seems pretty similar.

kdaip / stabletts Goto Github PK

stabletts's People

Stargazers

Watchers

Forkers

stabletts's Issues

python train.py

pip list | grep torch

Recommend Projects

Recommend Topics

Recommend Org