kdaip / stabletts Goto Github PK
View Code? Open in Web Editor NEWNext-generation TTS model using flow-matching and DiT, inspired by Stable Diffusion 3
License: MIT License
Next-generation TTS model using flow-matching and DiT, inspired by Stable Diffusion 3
License: MIT License
I am trying to train a TTS but I am wondering about the style of the speakers? My dataset contains multiple speakers with different speaking styles. Does the model retain the style for each voice or it uses only one style or it depends on the refer audio. For example In my dataset it contains Indian speaker who pauses nervously in conversation. When i train it with all the dataset and use one audio from that speaker and infer will it inhabit the nervous speaking style? Please I dearly wait for your response and thanks for this great repo
很棒的项目,我训练后可以正常推理,发音也正常。
但和训练素材对比,音质听起来不是很明亮和清脆,(我确认不是训练素材质量问题)。
检查了训练素材音频采样率和配置保持一致 44100 。
如何改善推理的音质呢?
再次感谢~
依赖pytorch估计就跑不起来了吧,有人搞起来了没?
Hi
I discovered this project and is pretty amazing the results provided, I saw that it got updated to get support for the Japanese language, and that gave me the curiosity of how many epochs or hours of training were required for the current pretrained models provided, so if is possible to know, I'll be very thankful since I'm gonna train Japanese from scratch and I would like to have some idea of the amount of time/hardware I'll require
Thanks!
concat.zip
I don't know whether it supports zero-shot
RT
Please I have been working with your repo for while now and it is create and fast. I love it a lot. Please the model you release fails to pronounce some words correctly. Please when will you release the second Model. Thanks a lot for providing this repo to the community
加载自定义词典成功
加载自定义词典成功
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0425 17:05:53.758841 4566 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=187465152
Traceback (most recent call last):
File "train.py", line 106, in
torch.multiprocessing.spawn(train, args=(world_size,), nprocs=world_size, join=True)
File "/usr/local/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 246, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
File "/usr/local/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 202, in start_processes
while not context.join():
File "/usr/local/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 145, in join
raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with signal SIGBUS
env
apex 0.1+2a4864d.abi0.dtk2310.torch2.1
torch 2.1.0a0+git793d2b5.abi0.dtk2310
torchaudio 2.1.2+4b32183.abi0.dtk2310.torch2.1.0a0
torchvision 0.16.0+git267eff6.abi0.dtk2310.torch2.1.0
Are there any specific data, in terms of performance on CPU and GPU?
Hi @KdaiP nice work, just like to know is this architecture is intended to support zero-shot TTS or normal multi-speaker kind of TTS,
Hey hey @KdaiP,
Thanks for open-sourcing your implementation. I'm VB, I work in the open source audio team at Hugging Face. I'd love to know more and see how we can potentially help you with your experiments or share some of the learnings.
If you are interested then feel free to ping me at vaibhav[at]hf[dot]co
Looking forward to the release of the checkpoints!
Cheers,
VB
Hi! Thanks for open-sourcing your work! I like the idea, so I tried copying the CFM decoder and use it in my TTS setup. First I had some issues with NaN values after the attention in the estimator was computed for all the values that were padding. I fixed this issue using masked_fill with the x_mask instead of just multiplying with the x_mask, although I'm not sure why that was necessary for me.
But now that I could run a forward pass and calculate the CFM loss and it was not NaN, I thought it's good to go. However more problems came up. No matter what I tried, the CFM decoder would always produce a couple of NaN values after its first update during training. Do you have any idea what could cause this? Is there any specific thing I need to do to stabilize the training? The setup I am using works very well for lots of other architectures, including e.g. the normalizing flow decoder of PortaSpeech, which seems pretty similar.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.