yl4579 / styletts Goto Github PK
View Code? Open in Web Editor NEWOfficial Implementation of StyleTTS
License: MIT License
Official Implementation of StyleTTS
License: MIT License
from monotonic_align import mask_from_lens
error, cannot import name "mask_from_lens" from "monotonic_align"
my monotonic_align version is 1.0.0
Are the Hifigan checkpoints copied from somewhere? The paper is suggestive of https://github.com/kan-bayashi/ParallelWaveGAN, but https://drive.google.com/drive/folders/10jBLsjQT3LvR-3GgPZpRvWIWvpGjzDnM (the LibriTTS Hifigan checkpoint from https://github.com/kan-bayashi/ParallelWaveGAN#results) has 2500000 steps, as opposed to g_00935000 from the link under StyleTTS, so it doesn't match.
Did the authors pretrain Hifigan themselves from scratch?
Hi!
Thanks for the great work!
I trained a model on ryanspeech, and it sounds great!
I have a small dataset of around 2h, so is it possible to finetune an existing model on a small dataset?
Thanks!
And happy Chinese New Year!
Correct me if I'm wrong. The other issue I opened is actually is a soundfile related error (got I know when I degraded the soundfile version)
so it can't just read "wavs/22.wav" because the wav_path in meldataset.py has to include the directory location of train_list.txt file. Got my point? Though the complete thing works if only the wavs folder is kept in the StyleTTS directory (with train_list.txt whereever as input in config file)
Thanks for you work. I tried to train stage1 with multi-chinese dataset,about 420+ hours. The training process is normal until some step the s2s loss become nan,after that the parameters of text aligner is nan too.
What can i do to avoid this problem?
I tried to give an epsilon to _s2s_pred and set the loss to 0,but the parameters is still nan.
Hello there! new to this speech synthesis stuff, whats the process for setting up a dataset like the VCTK dataset (downsampled to 24khz) and processing it in the proper format like in the example for LibriTTS "filename.wav|transcription".
"LibriTTS/train-clean-460/7169/89735/7169_89735_000071_000003.wav|wˈʌn mˌeɪd ˌʌp wˈʌnz mˈaɪnd ðə lˈoʊn wˈʊlf mˈʌst biː ɐ sˈɜːʔn̩ sˈɔːɹt ʌv mˈæn; ðə ɹˈɛst wʌz sˈɪmpli sˈɪftɪŋ fɹˈæns fɚðə mˈæn tə fˈɪt ðə θˈiəɹi, ænd ðˈɛn wˈɑːtʃɪŋ hˌɪm ʌntˈɪl hiː ɡˈeɪv hɪmsˈɛlf ɐwˈeɪ."
"
its not clear to me how i get the transciption from the text and set all of that up. If any one has set up a dataset like this one before and can help point me in the right direction that would be awesome! Thanks.
Hello, Yinghao Aaron Li
I use mel from "torchaudio.transforms.MelSpectrogram" to test your pretained HiGi-Gan model, is ok.
but I use the mel to finetune the your pretained HiGi-Gan model, the syn wav is noise.
I use your vocoder.py , use HiFi-Gan code in https://github.com/jik876/hifi-gan, train the model, syn wav is also ok.
What's the difference, in your training process ? Are you changed something ?
Hi, I find that the model produces really great outputs when synthesizing sentences of medium or long length. However, when synthesizing short sentences or words, the audio sounds weird. For example, I used the pretrained LibriTTS model to synthesize the word “people” (IPA: pˈiːpəl). No matter what reference audio I gave, long or short, the output was weird, especially at the end of the word. This phenomenon occurred when the input was short. I attached an example audio here people.wav.zip. Do you have any methods to solve or alleviate this problem? Thank you very much!
I am trying to have the inference demo output something intelligible in this collab.
But for now all my efforts are vain. I made sure to upsample the wav for prosody, but still getting rubbish. Could you point me where I am wrong ?
can you provide a pretrained model of stage 1
Thank you very much
I'm pretraining your model on the vivo dataset (Vietnamese) but the results are not what I expected.
Here is the original audio:
https://drive.google.com/file/d/12mZdg8yVhgQj35Vt3thoxIK44_jWWaCJ/view?usp=sharing
and here is the result:
https://drive.google.com/file/d/1UOuUHrxiR5DvF1MrpccMZ6bdwKyfO2AE/view?usp=sharing
This is the loss during training stage 1 and stage 2
p/s: I used the ASR of the original article to train Vietnamese. I wonder if it has any problems because during the training stage 1 there were quite a lot of keyerror errors.
Thank you very much
Thanks a lot
the pretrained ASR model use this ""AA0",10
"AA1",11
"AA2",12
"AE0",13
"AE1",14
"AE2",15
"AH0",16
"AH1",17
"AH2",18
"AO0",19
"AO1",20
"AO2",21
"AW0",22
"AW1",23",
Why?
Why don't use "attention_weight" in train_first.py ?
the code used is "alignment". in train_first.py line 153:
ppgs, s2s_pred, s2s_attn_feat = model.text_aligner(mels, mask, texts)
in layer.py:
attention_weights = F.softmax(alignment, dim=1)
attention_weight is the result after softmax, but alignment is not.
As the title suggests, how would I add a pause in between multiple sentences, for example after a full stop?
(demo) C:\Users\Administrator\StyleTTS>python train_first.py --config_path ./Configs/config.yml
{'max_lr': 0.0001, 'pct_start': 0.0, 'epochs': 200, 'steps_per_epoch': 3}
{'max_lr': 0.0001, 'pct_start': 0.0, 'epochs': 200, 'steps_per_epoch': 3}
{'max_lr': 0.0001, 'pct_start': 0.0, 'epochs': 200, 'steps_per_epoch': 3}
{'max_lr': 0.0001, 'pct_start': 0.0, 'epochs': 200, 'steps_per_epoch': 3}
{'max_lr': 0.0001, 'pct_start': 0.0, 'epochs': 200, 'steps_per_epoch': 3}
{'max_lr': 0.0001, 'pct_start': 0.0, 'epochs': 200, 'steps_per_epoch': 3}
{'max_lr': 0.0001, 'pct_start': 0.0, 'epochs': 200, 'steps_per_epoch': 3}
{'max_lr': 0.0001, 'pct_start': 0.0, 'epochs': 200, 'steps_per_epoch': 3}
{'max_lr': 0.0001, 'pct_start': 0.0, 'epochs': 200, 'steps_per_epoch': 3}
bert loaded
bert_encoder loaded
predictor loaded
decoder loaded
pitch_extractor loaded
text_encoder loaded
style_encoder loaded
text_aligner loaded
discriminator loaded
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ C:\Users\Administrator\StyleTTS\train_first.py:393 in │
│ │
│ 390 │ torch.save(state, save_path) │
│ 391 │
│ 392 if name=="main": │
│ ❱ 393 │ main() │
│ 394 │
│ │
│ C:\Users\Administrator\miniconda3\envs\demo\lib\site-packages\click\core.py:1157 in call │
│ │
│ 1154 │ │
│ 1155 │ def call(self, *args: t.Any, **kwargs: t.Any) -> t.Any: │
│ 1156 │ │ """Alias for :meth:main
.""" │
│ ❱ 1157 │ │ return self.main(*args, **kwargs) │
│ 1158 │
│ 1159 │
│ 1160 class Command(BaseCommand): │
│ │
│ C:\Users\Administrator\miniconda3\envs\demo\lib\site-packages\click\core.py:1078 in main │
│ │
│ 1075 │ │ try: │
│ 1076 │ │ │ try: │
│ 1077 │ │ │ │ with self.make_context(prog_name, args, **extra) as ctx: │
│ ❱ 1078 │ │ │ │ │ rv = self.invoke(ctx) │
│ 1079 │ │ │ │ │ if not standalone_mode: │
│ 1080 │ │ │ │ │ │ return rv │
│ 1081 │ │ │ │ │ # it's not safe to ctx.exit(rv)
here! │
│ │
│ C:\Users\Administrator\miniconda3\envs\demo\lib\site-packages\click\core.py:1434 in invoke │
│ │
│ 1431 │ │ │ echo(style(message, fg="red"), err=True) │
│ 1432 │ │ │
│ 1433 │ │ if self.callback is not None: │
│ ❱ 1434 │ │ │ return ctx.invoke(self.callback, **ctx.params) │
│ 1435 │ │
│ 1436 │ def shell_complete(self, ctx: Context, incomplete: str) -> t.List["CompletionItem"]: │
│ 1437 │ │ """Return a list of completions for the incomplete value. Looks │
│ │
│ C:\Users\Administrator\miniconda3\envs\demo\lib\site-packages\click\core.py:783 in invoke │
│ │
│ 780 │ │ │
│ 781 │ │ with augment_usage_errors(__self): │
│ 782 │ │ │ with ctx: │
│ ❱ 783 │ │ │ │ return __callback(*args, **kwargs) │
│ 784 │ │
│ 785 │ def forward( │
│ 786 │ │ __self, __cmd: "Command", *args: t.Any, **kwargs: t.Any # noqa: B902 │
│ │
│ C:\Users\Administrator\StyleTTS\train_first.py:140 in main │
│ │
│ 137 │ │ │
│ 138 │ │ _ = [model[key].train() for key in model] │
│ 139 │ │ │
│ ❱ 140 │ │ for i, batch in enumerate(train_dataloader): │
│ 141 │ │ │ │
│ 142 │ │ │ batch = [b.to(device) for b in batch] │
│ 143 │ │ │ texts, input_lengths, mels, mel_input_length = batch │
│ │
│ C:\Users\Administrator\miniconda3\envs\demo\lib\site-packages\torch\utils\data\dataloader.py:633 │
│ in next │
│ │
│ 630 │ │ │ if self._sampler_iter is None: │
│ 631 │ │ │ │ # TODO(pytorch/pytorch#76750) │
│ 632 │ │ │ │ self._reset() # type: ignore[call-arg] │
│ ❱ 633 │ │ │ data = self._next_data() │
│ 634 │ │ │ self._num_yielded += 1 │
│ 635 │ │ │ if self._dataset_kind == _DatasetKind.Iterable and \ │
│ 636 │ │ │ │ │ self._IterableDataset_len_called is not None and \ │
│ │
│ C:\Users\Administrator\miniconda3\envs\demo\lib\site-packages\torch\utils\data\dataloader.py:134 │
│ 5 in _next_data │
│ │
│ 1342 │ │ │ │ self._task_info[idx] += (data,) │
│ 1343 │ │ │ else: │
│ 1344 │ │ │ │ del self._task_info[idx] │
│ ❱ 1345 │ │ │ │ return self._process_data(data) │
│ 1346 │ │
│ 1347 │ def _try_put_index(self): │
│ 1348 │ │ assert self._tasks_outstanding < self._prefetch_factor * self._num_workers │
│ │
│ C:\Users\Administrator\miniconda3\envs\demo\lib\site-packages\torch\utils\data\dataloader.py:137 │
│ 1 in _process_data │
│ │
│ 1368 │ │ self._rcvd_idx += 1 │
│ 1369 │ │ self._try_put_index() │
│ 1370 │ │ if isinstance(data, ExceptionWrapper): │
│ ❱ 1371 │ │ │ data.reraise() │
│ 1372 │ │ return data │
│ 1373 │ │
│ 1374 │ def _mark_worker_as_unavailable(self, worker_id, shutdown=False): │
│ │
│ C:\Users\Administrator\miniconda3\envs\demo\lib\site-packages\torch_utils.py:644 in reraise │
│ │
│ 641 │ │ │ # If the exception takes multiple arguments, don't try to │
│ 642 │ │ │ # instantiate since we don't know how to │
│ 643 │ │ │ raise RuntimeError(msg) from None │
│ ❱ 644 │ │ raise exception │
│ 645 │
│ 646 │
│ 647 def _get_available_device_type(): │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
LibsndfileError: <exception str() failed>
Hello, would it generally be possible to train StyleTTS on a multilingual dataset, e.g. by additionally conditioning the text encoder with a language embedding?
Hey, thanks for releasing the code. I came across this after reading the paper. I just wonder if you ever tried training the whole model together and how it performed compared to the 2 stage approach. Maybe it is just me missing in the paper, but I don't see a clear comparison although everything else is quite clear, especially with the Appendix. Thanks again.
https://github.com/yl4579/StyleTTS/blob/main/train_second.py#L297
the above line is not in with torch.no_grad(), is it a problem?
This is just a heads-up about the discussion in #7.
I tried training the model end2end in different ways but F0 and Energy predictor was always underfitting although eval loss was also going down. They never were able to predict useful values for inference.
Here is roughly my forward pass. I can also share the branch if useful. Happy to see any feedback.
@typechecked
def forward_all(
self,
texts: TensorType["B", "T_text"],
input_lengths: TensorType["B"],
mels: TensorType["B", "C_mel", "T_mel"],
mel_input_length: TensorType["B"],
F0_real: TensorType["B", 1, "T_mel"],
):
# TODO: use Pitch Extractor (maybe torchcrepe)
# mask = length_to_mask(mel_input_length // (2 ** model.text_aligner.n_down)).to(self.device)
text_mask = self.lengths_to_mask(input_lengths).to(self.device)
mel_mask = self.lengths_to_mask(mel_input_length).to(self.device)
##### --> TEXT ENCODER
t_en, t_emb = self.text_encoder(texts, input_lengths, length_to_mask(input_lengths))
##### --> ALIGNER
_, aligner_soft, aligner_logprob, aligner_hard = self._forward_aligner(
x=t_emb.detach().transpose(1, 2),
y=mels,
x_mask=text_mask,
y_mask=mel_mask,
attn_priors=None,
)
##### --> EXPAND
t_en_ex = t_en @ aligner_hard.squeeze(1)
##### --> PRUNE THE BATCH BY THE SHORTEST MEL LENGTH
mel_len = int(mel_input_length.min().item())
t_en_ex_clipped = []
mels_clipped = []
F0s = []
idxs = []
for bib in range(len(mel_input_length)):
mel_length = int(mel_input_length[bib].item()) + 1
random_start = np.random.randint(0, mel_length - mel_len)
idxs.append(random_start)
t_en_ex_clipped.append(t_en_ex[bib, :, random_start : random_start + mel_len])
mels_clipped.append(mels[bib, :, random_start : random_start + mel_len])
F0s.append(F0_real[bib, :, random_start : random_start + mel_len])
t_en_ex_clipped = torch.stack(t_en_ex_clipped)
mels_clipped = torch.stack(mels_clipped).detach()
F0_real = torch.stack(F0s).detach()
##### --> CALCULATE REAL ENERGY
N_real = log_norm(mels_clipped.unsqueeze(1)).squeeze(1).detach()
# F0_real, _, _ = self.pitch_extractor(gt.unsqueeze(1))
##### --> STYLE ENCODER
s = self.style_encoder(mels_clipped.unsqueeze(1))
##### --> DURATION PREDICTOR & PROSOODY ENCODER
dur_aligner = aligner_hard.sum(axis=-1).detach()
dur_pred, prosody_en_ex = self.predictor(
t_en.detach(), s.detach(), input_lengths, aligner_hard.squeeze(1), length_to_mask(input_lengths)
) # [B, T_en]
d_align_mask = self.lengths_to_mask(input_lengths) * self.lengths_to_mask(mel_input_length).transpose(
1, 2
) # [B, 1, T_enc] * [B, T_dec, 1]
dur_alignment = generate_path(dur_pred, d_align_mask.squeeze(1).transpose(1, 2)).detach() # [B, T_dec, T_enc]
# Strip prosody tensor to match the mel length
p_en = []
for bib, start in enumerate(idxs):
p_en.append(prosody_en_ex[bib, :, start : (start + mel_len)])
p_en = torch.stack(p_en)
##### --> Pitch and Energy Predictor
F0_fake, N_fake = self.predictor.F0Ntrain(p_en, s.detach())
##### --> DECODER
mel_rec = self.decoder(t_en_ex_clipped, F0_real.squeeze(1), N_real, s)
return {
"mel_rec": mel_rec,
"gt": mels_clipped,
"aligner_logprob": aligner_logprob,
"aligner_hard": aligner_hard.squeeze(1),
"aligner_soft": aligner_soft,
"F0_real": F0_real.squeeze(1),
"F0_fake": F0_fake,
"N_real": N_real,
"N_fake": N_fake,
"d": dur_pred,
"d_gt": dur_aligner.squeeze(1),
"d_alignment": dur_alignment,
}
I'm trying to replicate your results. How did you create your any-to-any examples? Did you change the voice to text from the "source" audio and use the "reference" audio as a zero-shot reference?
Similarly, in the emotion examples, were those also zero-shot where you just used the file from ESD as a reference to create the reference embedding?
Thanks!
I really get into trouble when starting a project with a Vietnamese dataset, Can you describe in detail every task that I will do before starting with styleTTS and the format of data to train it to look like the file in your folder data
Or like here?
I would really appreciate it if you could help me describe each step clearly
Thank you, Hope you have a good day
As shown in the demo: https://styletts.github.io/
Can someone provide the code for Emotional speech synthesis for prosody transfer?
I have installed Phonemizer, but I am receiving an error about missing espeak. I have also tried to download espeak from the website and added it to the environment variable, but I am still receiving this error. I am also receiving another error about unicode decode.
`---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
~\AppData\Local\Temp\ipykernel_18492\1790917207.py in
1 # load phonemizer
2 import phonemizer
----> 3 global_phonemizer = phonemizer.backend.EspeakBackend(language='en-us', preserve_punctuation=True, with_stress=True)
~\anaconda3.1\lib\site-packages\phonemizer\backend\espeak\espeak.py in init(self, language, punctuation_marks, preserve_punctuation, with_stress, tie, language_switch, words_mismatch, logger)
45 super().init(
46 language, punctuation_marks=punctuation_marks,
---> 47 preserve_punctuation=preserve_punctuation, logger=logger)
48
49 self._espeak.set_voice(language)
~\anaconda3.1\lib\site-packages\phonemizer\backend\espeak\base.py in init(self, language, punctuation_marks, preserve_punctuation, logger)
41 punctuation_marks=punctuation_marks,
42 preserve_punctuation=preserve_punctuation,
---> 43 logger=logger)
44
45 self._espeak = EspeakWrapper()
~\anaconda3.1\lib\site-packages\phonemizer\backend\base.py in init(self, language, punctuation_marks, preserve_punctuation, logger)
76 if not self.is_available():
77 raise RuntimeError( # pragma: nocover
---> 78 '{} not installed on your system'.format(self.name()))
79
80 self._logger = logger
RuntimeError: espeak not installed on your system`
`---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
~\AppData\Local\Temp\ipykernel_15548\494391268.py in
3 train_path = config.get('train_data', None)
4 val_path = config.get('val_data', None)
----> 5 train_list, val_list = get_data_path_list(train_path, val_path)
6
7 ref_dicts = {}
~\Desktop\github\StyleTTS\utils.py in get_data_path_list(train_path, val_path)
33
34 with open(train_path, 'r') as f:
---> 35 train_list = f.readlines()
36 with open(val_path, 'r') as f:
37 val_list = f.readlines()
~\anaconda3.1\lib\encodings\cp1250.py in decode(self, input, final)
21 class IncrementalDecoder(codecs.IncrementalDecoder):
22 def decode(self, input, final=False):
---> 23 return codecs.charmap_decode(input,self.errors,decoding_table)[0]
24
25 class StreamWriter(Codec,codecs.StreamWriter):
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 39: character maps to `
Can you tell me how to Inference
https://github.com/yl4579/StyleTTS/blob/main/train_second.py#L307
I didn't see the details about it in your paper.
StyleTTS's ref mel requires a single audio as input, which may result in the style vector only being similar to the ref wav, but somewhat different from other waves of the same speaker. May I ask if the ref mel you use for evaluations is gt or randomly selected from the dataset during your evaluation stage?
Hello,
It truly is an impressive work!
I wonder if it is possible to control the speed of the output speech.
Thank you!
Hi, I'm trying to train StyleTTS with custom data, and I got a little cofusion when padding the phoneme sequence.
When pre-training AuxiliaryASR, in meldataset.py the _load_tensor function pad the phoneme sequence by blank_index.
def _load_tensor(self, data):
wave_path, text, speaker_id = data
speaker_id = int(speaker_id)
wave, sr = sf.read(wave_path)
# phonemize the text
ps = self.g2p(text.replace('-', ' '))
if "'" in ps:
ps.remove("'")
text = self.text_cleaner(ps)
blank_index = self.text_cleaner.word_index_dictionary[" "]
text.insert(0, blank_index) # add a blank at the beginning (silence)
text.append(blank_index) # add a blank at the end (silence)
text = torch.LongTensor(text)
return wave, text, speaker_id
But in StyleTTS's meldataset.py the _load_tensor function just pad the phoneme sequence by 0.
def _load_tensor(self, data):
wave_path, text, speaker_id = data
speaker_id = int(speaker_id)
wave, sr = sf.read(wave_path)
if wave.shape[-1] == 2:
wave = wave[:, 0].squeeze()
if sr != 24000:
wave = librosa.resample(wave, sr, 24000)
print(wave_path, sr)
wave = np.concatenate([np.zeros([5000]), wave, np.zeros([5000])], axis=0)
text = self.text_cleaner(text)
text.insert(0, 0)
text.append(0)
text = torch.LongTensor(text)
return wave, text, speaker_id
The questions are:
Thank you.
Great works!
I found that when there are other language characters, the synthesizer will repeatedly read the language+letter,What should I pay attention to if I want to add other languages.
Can I train StyleTTS on my custom data in another language?
after starting training i am getting the following error, sometimes right away, sometimes after a few steps
./aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [652,0,0], thread: [124,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [652,0,0], thread: [125,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [652,0,0], thread: [126,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [652,0,0], thread: [127,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
Traceback (most recent call last): |
File "/home/tts/StyleTTS/train_first.py", line 393, in <module>
main()
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1130, in __call__
return self.main(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1055, in main
rv = self.invoke(ctx)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 760, in invoke
return __callback(*args, **kwargs)
File "/home/tts/StyleTTS/train_first.py", line 149, in main
ppgs, s2s_pred, s2s_attn_feat = model.text_aligner(mels, mask, texts)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/tts/StyleTTS/Utils/ASR/models.py", line 45, in forward
_, s2s_logit, s2s_attn = self.asr_s2s(x, src_key_padding_mask, text_input)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/tts/StyleTTS/Utils/ASR/models.py", line 130, in forward
print(f"... {text_input} {decoder_inputs.size(1)}")
File "/opt/conda/lib/python3.10/site-packages/torch/_tensor.py", line 873, in __format__
return object.__format__(self, format_spec)
File "/opt/conda/lib/python3.10/site-packages/torch/_tensor.py", line 426, in __repr__
return torch._tensor_str._str(self, tensor_contents=tensor_contents)
File "/opt/conda/lib/python3.10/site-packages/torch/_tensor_str.py", line 636, in _str
return _str_intern(self, tensor_contents=tensor_contents)
File "/opt/conda/lib/python3.10/site-packages/torch/_tensor_str.py", line 567, in _str_intern
tensor_str = _tensor_str(self, indent)
File "/opt/conda/lib/python3.10/site-packages/torch/_tensor_str.py", line 327, in _tensor_str
formatter = _Formatter(get_summarized_data(self) if summarize else self)
File "/opt/conda/lib/python3.10/site-packages/torch/_tensor_str.py", line 111, in __init__
value_str = "{}".format(value)
File "/opt/conda/lib/python3.10/site-packages/torch/_tensor.py", line 872, in __format__
return self.item().__format__(fo
Hey im trying to add my BigVGAN vocoder model to the inferencing script,. But when it generates audio it always has a lot of noise, compared to the inferencing script of the original BigVGAN code base. Any Ideas on why that could be? It looks to be the same setup as HiFi-GAN. https://github.com/NVIDIA/BigVGAN. If you would like one of my trained Models let me know ill give you DL link so you can test with it... as there are currently no available models.
Thanks in advance!
%cd /content/BigVGAN
from __future__ import absolute_import, division, print_function, unicode_literals
#import sys
#sys.path.append("./content/BigVGAN")
import glob
import os
import argparse
import json
from scipy.io.wavfile import write
from env import AttrDict
from meldataset1 import mel_spectrogram, MAX_WAV_VALUE
from models1 import BigVGAN as Generator
import librosa
torch.backends.cudnn.benchmark = False
def load_checkpoint(filepath, device):
assert os.path.isfile(filepath)
print("Loading '{}'".format(filepath))
checkpoint_dict = torch.load(filepath, map_location=device)
print("Complete.")
return checkpoint_dict
def scan_checkpoint(cp_dir, prefix):
pattern = os.path.join(cp_dir, prefix + '*')
cp_list = glob.glob(pattern)
if len(cp_list) == 0:
return ''
return sorted(cp_list)[-1]
cp_g = scan_checkpoint("configs", 'g_001')
config_file = os.path.join(os.path.split(cp_g)[0], 'bigvgan_24khz_100band.json') #actually 80-band to work with the StyleTTS model
with open(config_file) as f:
data = f.read()
json_config = json.loads(data)
h = AttrDict(json_config)
device = torch.device(device)
generator = Generator(h).to(device)
state_dict_g = load_checkpoint(cp_g, device)
generator.load_state_dict(state_dict_g['generator'])
generator.eval()
generator.remove_weight_norm()
%cd /content/StyleTTS
import time
converted_samples = {}
start_time = time.time()
input_length = torch.LongTensor([tokens.shape[-1]]).to(device)
mask = length_to_mask(input_length).to(device)
with torch.no_grad():
input_lengths = torch.LongTensor([tokens.shape[-1]]).to(device)
m = length_to_mask(input_lengths).to(device)
t_en = model.text_encoder(tokens, input_lengths, m)
for key, (ref, _) in reference_embeddings.items():
s = ref.squeeze(1).to(device)
style = s
d = model.predictor.text_encoder(t_en, style, input_lengths, m)
x, _ = model.predictor.lstm(d)
duration = model.predictor.duration_proj(x)
pred_dur = torch.round(duration.squeeze()).clamp(min=1)
pred_aln_trg = torch.zeros(input_lengths, int(pred_dur.sum().data), device=device)
c_frame = 0
for i in range(pred_aln_trg.size(0)):
pred_aln_trg[i, c_frame:c_frame + int(pred_dur[i].data)] = 1
c_frame += int(pred_dur[i].data)
en = (d.transpose(-1, -2) @ pred_aln_trg.unsqueeze(0))
style = s.expand(en.shape[0], en.shape[1], -1)
F0_pred, N_pred = model.predictor.F0Ntrain(en, s)
out = model.decoder((t_en @ pred_aln_trg.unsqueeze(0)), F0_pred, N_pred, ref.squeeze().unsqueeze(0))
audio_signal = out.cpu().numpy().squeeze()
#Apply the Mel Spectrogram transformation
mel_spectrogram = librosa.feature.melspectrogram(y=audio_signal, sr=24000, n_fft=1024, hop_length=256, n_mels=80, win_length=1024)
#Convert the Mel Spectrogram to decibels
mel_spectrogram_db = librosa.power_to_db(mel_spectrogram, ref=np.max)
out1 = torch.FloatTensor(mel_spectrogram_db).to(device)
y_g_hat = generator(out)
y_out = y_g_hat.squeeze()
y_out1 = y_out * MAX_WAV_VALUE
y_out2 = y_out1.cpu().numpy()
converted_samples[key] = y_out2
end_time = time.time()
print("Time taken: ", end_time - start_time, "seconds")
Also tried the original with the same result grain/noisy audio.
y_g_hat = generator(out)
y_out = y_g_hat.squeeze().cpu().numpy()
converted_samples[key] = y_out
Great work, thanks.
mandrain support?
If we do not use pretrained ASR ckpt, just let the asr model update parameters according to TTS model. Can we get good TTS result?
I observe very slow progress with the duration loss at the second stage of the training. Is this something accepted or you might think of any issue that might be causing it?
For each epoch, the eval loss is going 2.21 -> 2.20 -> 2.18 ... whereas the F0 loss converged very quickly.
BTW I am using VCTK + LibriTTS.
I also tried reducing the dropout to 0.1 for the duration projection layer but didn't help.
Does text_utils.py (https://github.com/yl4579/StyleTTS/blob/main/text_utils.py) require:
import os
import os.path as osp
import pandas as pd
StyleTTS/Demo/Inference_LJSpeech.ipynb
Line 326 in eac6715
I could not find any mention in the paper. I wonder why the expanded text representation is half of the mel length as in the line below. Would you mind explaining the reason?
Line 327 in 451540d
Hi,
Is it possible to finetune this model?
Thanks
Traceback (most recent call last):
File "/home/jumpcloud/libraries/StyleTTS/train_second.py", line 494, in
main()
File "/home/jumpcloud/miniconda3/envs/work/lib/python3.10/site-packages/click/core.py", line 1157, in call
return self.main(*args, **kwargs)
File "/home/jumpcloud/miniconda3/envs/work/lib/python3.10/site-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
File "/home/jumpcloud/miniconda3/envs/work/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/jumpcloud/miniconda3/envs/work/lib/python3.10/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/home/jumpcloud/libraries/StyleTTS/train_second.py", line 220, in main
bert_dur = model.bert(texts, attention_mask=(~text_mask).int()).last_hidden_state
File "/home/jumpcloud/miniconda3/envs/work/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/jumpcloud/miniconda3/envs/work/lib/python3.10/site-packages/transformers/models/albert/modeling_albert.py", line 724, in forward
buffered_token_type_ids_expanded = buffered_token_type_ids.expand(batch_size, seq_length)
RuntimeError: The expanded size of the tensor (584) must match the existing size (512) at non-singleton dimension 1. Target sizes: [12, 584]. Tensor sizes: [1, 512]
Hi @yl4579 , thanks for you amazing work.
I want to finetune the HifiGAN model, but seems the pretrained weight only has generator. Could you also publish the discriminator? Thanks.
When i try to use the provided pretrained model and train with multiple GPUs I get these key errors. "Missing key(s) in state_dict:", "Unexpected key(s) in state_dict:".
Is there any way to use the single GPU pretrained model, as a starting point to train with multiple GPUs?
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format( RuntimeError: Error(s) in loading state_dict for MyDataParallel: Missing key(s) in state_dict: "module.text_encoder.lstms.0.weight_ih_l0", "module.text_encoder.lstms.0.weight_hh_l0", "module.text_encoder.lstms.0.bias_ih_l0", "module.text_encoder.lstms.0.bias_hh_l0", "module.text_encoder.lstms.0.weight_ih_l0_reverse", "module.text_encoder.lstms.0.weight_hh_l0_reverse", "module.text_encoder.lstms.0.bias_ih_l0_reverse", "module.text_encoder.lstms.0.bias_hh_l0_reverse", "module.text_encoder.lstms.1.fc.weight", "module.text_encoder.lstms.1.fc.bias", "module.text_encoder.lstms.2.weight_ih_l0", "module.text_encoder.lstms.2.weight_hh_l0", "module.text_encoder.lstms.2.bias_ih_l0", "module.text_encoder.lstms.2.bias_hh_l0", "module.text_encoder.lstms.2.weight_ih_l0_reverse", "module.text_encoder.lstms.2.weight_hh_l0_reverse", "module.text_encoder.lstms.2.bias_ih_l0_reverse", "module.text_encoder.lstms.2.bias_hh_l0_reverse", "module.text_encoder.lstms.3.fc.weight", "module.text_encoder.lstms.3.fc.bias", "module.text_encoder.lstms.4.weight_ih_l0", "module.text_encoder.lstms.4.weight_hh_l0", "module.text_encoder.lstms.4.bias_ih_l0", "module.text_encoder.lstms.4.bias_hh_l0", "module.text_encoder.lstms.4.weight_ih_l0_reverse", "module.text_encoder.lstms.4.weight_hh_l0_reverse", "module.text_encoder.lstms.4.bias_ih_l0_reverse", "module.text_encoder.lstms.4.bias_hh_l0_reverse", "module.text_encoder.lstms.5.fc.weight", "module.text_encoder.lstms.5.fc.bias", "module.lstm.weight_ih_l0", "module.lstm.weight_hh_l0", "module.lstm.bias_ih_l0", "module.lstm.bias_hh_l0", "module.lstm.weight_ih_l0_reverse", "module.lstm.weight_hh_l0_reverse", "module.lstm.bias_ih_l0_reverse", "module.lstm.bias_hh_l0_reverse", "module.duration_proj.linear_layer.weight", "module.duration_proj.linear_layer.bias", "module.shared.weight_ih_l0", "module.shared.weight_hh_l0", "module.shared.bias_ih_l0", "module.shared.bias_hh_l0", "module.shared.weight_ih_l0_reverse", "module.shared.weight_hh_l0_reverse", "module.shared.bias_ih_l0_reverse", "module.shared.bias_hh_l0_reverse", "module.F0.0.conv1.bias", "module.F0.0.conv1.weight_g", "module.F0.0.conv1.weight_v", "module.F0.0.conv2.bias", "module.F0.0.conv2.weight_g", "module.F0.0.conv2.weight_v", "module.F0.0.norm1.fc.weight", "module.F0.0.norm1.fc.bias", "module.F0.0.norm2.fc.weight", "module.F0.0.norm2.fc.bias", "module.F0.1.conv1.bias", "module.F0.1.conv1.weight_g", "module.F0.1.conv1.weight_v", "module.F0.1.conv2.bias", "module.F0.1.conv2.weight_g", "module.F0.1.conv2.weight_v", "module.F0.1.norm1.fc.weight", "module.F0.1.norm1.fc.bias", "module.F0.1.norm2.fc.weight", "module.F0.1.norm2.fc.bias", "module.F0.1.conv1x1.weight_g", "module.F0.1.conv1x1.weight_v", "module.F0.1.pool.bias", "module.F0.1.pool.weight_g", "module.F0.1.pool.weight_v", "module.F0.2.conv1.bias", "module.F0.2.conv1.weight_g", "module.F0.2.conv1.weight_v", "module.F0.2.conv2.bias", "module.F0.2.conv2.weight_g", "module.F0.2.conv2.weight_v", "module.F0.2.norm1.fc.weight", "module.F0.2.norm1.fc.bias", "module.F0.2.norm2.fc.weight", "module.F0.2.norm2.fc.bias", "module.N.0.conv1.bias", "module.N.0.conv1.weight_g", "module.N.0.conv1.weight_v", "module.N.0.conv2.bias", "module.N.0.conv2.weight_g", "module.N.0.conv2.weight_v", "module.N.0.norm1.fc.weight", "module.N.0.norm1.fc.bias", "module.N.0.norm2.fc.weight", "module.N.0.norm2.fc.bias", "module.N.1.conv1.bias", "module.N.1.conv1.weight_g", "module.N.1.conv1.weight_v", "module.N.1.conv2.bias", "module.N.1.conv2.weight_g", "module.N.1.conv2.weight_v", "module.N.1.norm1.fc.weight", "module.N.1.norm1.fc.bias", "module.N.1.norm2.fc.weight", "module.N.1.norm2.fc.bias", "module.N.1.conv1x1.weight_g", "module.N.1.conv1x1.weight_v", "module.N.1.pool.bias", "module.N.1.pool.weight_g", "module.N.1.pool.weight_v", "module.N.2.conv1.bias", "module.N.2.conv1.weight_g", "module.N.2.conv1.weight_v", "module.N.2.conv2.bias", "module.N.2.conv2.weight_g", "module.N.2.conv2.weight_v", "module.N.2.norm1.fc.weight", "module.N.2.norm1.fc.bias", "module.N.2.norm2.fc.weight", "module.N.2.norm2.fc.bias", "module.F0_proj.weight", "module.F0_proj.bias", "module.N_proj.weight", "module.N_proj.bias". Unexpected key(s) in state_dict: "text_encoder.lstms.0.weight_ih_l0", "text_encoder.lstms.0.weight_hh_l0", "text_encoder.lstms.0.bias_ih_l0", "text_encoder.lstms.0.bias_hh_l0", "text_encoder.lstms.0.weight_ih_l0_reverse", "text_encoder.lstms.0.weight_hh_l0_reverse", "text_encoder.lstms.0.bias_ih_l0_reverse", "text_encoder.lstms.0.bias_hh_l0_reverse", "text_encoder.lstms.1.fc.weight", "text_encoder.lstms.1.fc.bias", "text_encoder.lstms.2.weight_ih_l0", "text_encoder.lstms.2.weight_hh_l0", "text_encoder.lstms.2.bias_ih_l0", "text_encoder.lstms.2.bias_hh_l0", "text_encoder.lstms.2.weight_ih_l0_reverse", "text_encoder.lstms.2.weight_hh_l0_reverse", "text_encoder.lstms.2.bias_ih_l0_reverse", "text_encoder.lstms.2.bias_hh_l0_reverse", "text_encoder.lstms.3.fc.weight", "text_encoder.lstms.3.fc.bias", "text_encoder.lstms.4.weight_ih_l0", "text_encoder.lstms.4.weight_hh_l0", "text_encoder.lstms.4.bias_ih_l0", "text_encoder.lstms.4.bias_hh_l0", "text_encoder.lstms.4.weight_ih_l0_reverse", "text_encoder.lstms.4.weight_hh_l0_reverse", "text_encoder.lstms.4.bias_ih_l0_reverse", "text_encoder.lstms.4.bias_hh_l0_reverse", "text_encoder.lstms.5.fc.weight", "text_encoder.lstms.5.fc.bias", "lstm.weight_ih_l0", "lstm.weight_hh_l0", "lstm.bias_ih_l0", "lstm.bias_hh_l0", "lstm.weight_ih_l0_reverse", "lstm.weight_hh_l0_reverse", "lstm.bias_ih_l0_reverse", "lstm.bias_hh_l0_reverse", "duration_proj.linear_layer.weight", "duration_proj.linear_layer.bias", "shared.weight_ih_l0", "shared.weight_hh_l0", "shared.bias_ih_l0", "shared.bias_hh_l0", "shared.weight_ih_l0_reverse", "shared.weight_hh_l0_reverse", "shared.bias_ih_l0_reverse", "shared.bias_hh_l0_reverse", "F0.0.conv1.bias", "F0.0.conv1.weight_g", "F0.0.conv1.weight_v", "F0.0.conv2.bias", "F0.0.conv2.weight_g", "F0.0.conv2.weight_v", "F0.0.norm1.fc.weight", "F0.0.norm1.fc.bias", "F0.0.norm2.fc.weight", "F0.0.norm2.fc.bias", "F0.1.conv1.bias", "F0.1.conv1.weight_g", "F0.1.conv1.weight_v", "F0.1.conv2.bias", "F0.1.conv2.weight_g", "F0.1.conv2.weight_v", "F0.1.norm1.fc.weight", "F0.1.norm1.fc.bias", "F0.1.norm2.fc.weight", "F0.1.norm2.fc.bias", "F0.1.conv1x1.weight_g", "F0.1.conv1x1.weight_v", "F0.1.pool.bias", "F0.1.pool.weight_g", "F0.1.pool.weight_v", "F0.2.conv1.bias", "F0.2.conv1.weight_g", "F0.2.conv1.weight_v", "F0.2.conv2.bias", "F0.2.conv2.weight_g", "F0.2.conv2.weight_v", "F0.2.norm1.fc.weight", "F0.2.norm1.fc.bias", "F0.2.norm2.fc.weight", "F0.2.norm2.fc.bias", "N.0.conv1.bias", "N.0.conv1.weight_g", "N.0.conv1.weight_v", "N.0.conv2.bias", "N.0.conv2.weight_g", "N.0.conv2.weight_v", "N.0.norm1.fc.weight", "N.0.norm1.fc.bias", "N.0.norm2.fc.weight", "N.0.norm2.fc.bias", "N.1.conv1.bias", "N.1.conv1.weight_g", "N.1.conv1.weight_v", "N.1.conv2.bias", "N.1.conv2.weight_g", "N.1.conv2.weight_v", "N.1.norm1.fc.weight", "N.1.norm1.fc.bias", "N.1.norm2.fc.weight", "N.1.norm2.fc.bias", "N.1.conv1x1.weight_g", "N.1.conv1x1.weight_v", "N.1.pool.bias", "N.1.pool.weight_g", "N.1.pool.weight_v", "N.2.conv1.bias", "N.2.conv1.weight_g", "N.2.conv1.weight_v", "N.2.conv2.bias", "N.2.conv2.weight_g", "N.2.conv2.weight_v", "N.2.norm1.fc.weight", "N.2.norm1.fc.bias", "N.2.norm2.fc.weight", "N.2.norm2.fc.bias", "F0_proj.weight", "F0_proj.bias", "N_proj.weight", "N_proj.bias".
if i want to train the repo in Chinese with another phoneme set not like pypinyin, what do i need to do ? thanks!
I found this line of code in the meldataset.py file and I was curious about what it does. Why does wav need to be extended in the code?
wave = torch.cat([torch.zeros([5000]), wave, torch.zeros([5000])], axis=0)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.