Giter VIP home page Giter VIP logo

moonintheriver / diffsinger Goto Github PK

View Code? Open in Web Editor NEW
4.1K 41.0 695.0 63.37 MB

DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism (SVS & TTS); AAAI 2022; Official code

License: MIT License

Python 100.00%
text-to-speech diffusion-speedup tts aaai2022 singing-synthesis diffusion-model speech-synthesis singing-voice-synthesis singing-voice singing-voice-database midi

diffsinger's Introduction

DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism

arXiv GitHub Stars downloads Hugging Face Hugging Face

This repository is the official PyTorch implementation of our AAAI-2022 paper, in which we propose DiffSinger (for Singing-Voice-Synthesis) and DiffSpeech (for Text-to-Speech).

๐ŸŽ‰ ๐ŸŽ‰ ๐ŸŽ‰ Updates:

  • Sep.11, 2022: ๐Ÿ”Œ DiffSinger-PN. Add plug-in PNDM, ICLR 2022 in our laboratory, to accelerate DiffSinger freely.
  • Jul.27, 2022: Update documents for SVS. Add easy inference A & B; Add Interactive SVS running on HuggingFace๐Ÿค— SVS.
  • Mar.2, 2022: MIDI-B-version.
  • Mar.1, 2022: NeuralSVB, for singing voice beautifying, has been released.
  • Feb.13, 2022: NATSpeech, the improved code framework, which contains the implementations of DiffSpeech and our NeurIPS-2021 work PortaSpeech has been released.
  • Jan.29, 2022: support MIDI-A-version SVS.
  • Jan.13, 2022: support SVS, release PopCS dataset.
  • Dec.19, 2021: support TTS. HuggingFace๐Ÿค— TTS

๐Ÿš€ News:

  • Feb.24, 2022: Our new work, NeuralSVB was accepted by ACL-2022 arXiv. Demo Page.
  • Dec.01, 2021: DiffSinger was accepted by AAAI-2022.
  • Sep.29, 2021: Our recent work PortaSpeech: Portable and High-Quality Generative Text-to-Speech was accepted by NeurIPS-2021 arXiv .
  • May.06, 2021: We submitted DiffSinger to Arxiv arXiv.

Environments

  1. If you want to use env of anaconda:

    conda create -n your_env_name python=3.8
    source activate your_env_name 
    pip install -r requirements_2080.txt   (GPU 2080Ti, CUDA 10.2)
    or pip install -r requirements_3090.txt   (GPU 3090, CUDA 11.4)
  2. Or, if you want to use virtual env of python:

    ## Install Python 3.8 first. 
    python -m venv venv
    source venv/bin/activate
    # install requirements.
    pip install -U pip
    pip install Cython numpy==1.19.1
    pip install torch==1.9.0
    pip install -r requirements.txt

Documents

Overview

Mel Pipeline Dataset Pitch Input F0 Prediction Acceleration Method Vocoder
DiffSpeech (Text->F0, Text+F0->Mel, Mel->Wav) Ljspeech None Explicit Shallow Diffusion HiFiGAN
DiffSinger (Lyric+F0->Mel, Mel->Wav) PopCS Ground-Truth F0 None Shallow Diffusion NSF-HiFiGAN
DiffSinger (Lyric+MIDI->F0, Lyric+F0->Mel, Mel->Wav) OpenCpop MIDI Explicit Shallow Diffusion NSF-HiFiGAN
FFT-Singer (Lyric+MIDI->F0, Lyric+F0->Mel, Mel->Wav) OpenCpop MIDI Explicit Invalid NSF-HiFiGAN
DiffSinger (Lyric+MIDI->Mel, Mel->Wav) OpenCpop MIDI Implicit None Pitch-Extractor + NSF-HiFiGAN
DiffSinger+PNDM (Lyric+MIDI->Mel, Mel->Wav) OpenCpop MIDI Implicit PLMS Pitch-Extractor + NSF-HiFiGAN
DiffSpeech+PNDM (Text->Mel, Mel->Wav) Ljspeech None Implicit PLMS HiFiGAN

Tensorboard

tensorboard --logdir_spec exp_name
Tensorboard

Citation

@article{liu2021diffsinger,
  title={Diffsinger: Singing voice synthesis via shallow diffusion mechanism},
  author={Liu, Jinglin and Li, Chengxi and Ren, Yi and Chen, Feiyang and Liu, Peng and Zhao, Zhou},
  journal={arXiv preprint arXiv:2105.02446},
  volume={2},
  year={2021}}

Acknowledgements

Especially thanks to:

  • Team Openvpi's maintenance: DiffSinger.
  • Your re-creation and sharing.

diffsinger's People

Contributors

luping-liu avatar moonintheriver avatar mrzixi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

diffsinger's Issues

Why it need to infer fs2 before training ds?

# first run fs2 infer;
CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config usr/configs/popcs_fs2.yaml --exp_name popcs_fs2_pmf0_1230 --reset --infer 
# second run ds train;
CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config usr/configs/popcs_ds_beta6_offline.yaml --exp_name popcs_exp2 --reset

ไฝฟ็”จreadme็š„้ข„่ฎญ็ปƒๆจกๅž‹ๅ’Œๆญฅ้ชค่ท‘ๅ‡บๆฅ็š„่ฏญ้Ÿณไธ็ฌฆๅˆ้ข„ๆœŸ

ไฝฟ็”จreadme็š„้ข„่ฎญ็ปƒๆจกๅž‹ๅ’Œๆญฅ้ชค่ท‘ๅ‡บๆฅ็š„่ฏญ้Ÿณ๏ผŒๆ˜ฏๅฎŒๅ…จไนฑ็ ็š„ใ€‚ๆœŸ้—ดๆฒกๆœ‰ๆŠฅ้”™๏ผŒ่ฏท้—ฎๆ˜ฏไป€ไนˆๅŽŸๅ› ๅ•Š
ไผไธšๅพฎไฟกๆˆชๅ›พ_3d02a6ad-1c2a-4a49-8a8c-047597aa3698
[popcs-่ฏดๆ•ฃๅฐฑๆ•ฃ-0001][P]็ฎ—ไบ†ๅง@ๆˆ‘ไป˜ๅ‡บ่ฟ‡็”šไนˆๆฒกๅ…ณ็ณป@ๆˆ‘ๅฟฝ็•ฅ่‡ชๅทฑ@ๅฐฑๅ› ไธบ้‡่งไฝ -popcs_exp2.wav.zip

size mismatch for model.encoder_embed_tokens.weight: copying a param with shape torch.Size([62, 256]) from checkpoint, the shape in current model is torch.Size([57, 256]).

Having successfully run step 1, data preparation, I am now trying to run inference. I am using the given dataset preview.
Running CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config usr/configs/popcs_fs2.yaml --exp_name popcs_fs2_pmf0_1230 --reset --infer according to the readme.md, I end up with this error:

| model Trainable Parameters: 24.253M
Traceback (most recent call last):
  File "tasks/run.py", line 15, in <module>
    run_task()
  File "tasks/run.py", line 10, in run_task
    task_cls.start()
  File "/.../DiffSinger/tasks/base_task.py", line 258, in start
    trainer.test(task)
  File "/.../DiffSinger/utils/pl_utils.py", line 586, in test
    self.fit(model)
  File "/.../DiffSinger/utils/pl_utils.py", line 489, in fit
    self.run_pretrain_routine(model)
  File "/.../DiffSinger/utils/pl_utils.py", line 541, in run_pretrain_routine
    self.restore_weights(model)
  File "/.../DiffSinger/utils/pl_utils.py", line 617, in restore_weights
    self.restore_state_if_checkpoint_exists(model)
  File "/.../DiffSinger/utils/pl_utils.py", line 655, in restore_state_if_checkpoint_exists
    self.restore(last_ckpt_path, self.on_gpu)
  File "/.../DiffSinger/utils/pl_utils.py", line 668, in restore
    model.load_state_dict(checkpoint['state_dict'], strict=False)
  File "/.../envs/DiffSinger/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1223, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for FastSpeech2Task:
	size mismatch for model.encoder_embed_tokens.weight: copying a param with shape torch.Size([62, 256]) from checkpoint, the shape in current model is torch.Size([57, 256]).
	size mismatch for model.encoder.embed_tokens.weight: copying a param with shape torch.Size([62, 256]) from checkpoint, the shape in current model is torch.Size([57, 256]).

Do you have any ideas on what could be wrong here and how to resolve it?

why set `uv = f0 = 0` in norm_interp_f0?

def norm_interp_f0(f0, hparams):
is_torch = isinstance(f0, torch.Tensor)
if is_torch:
device = f0.device
f0 = f0.data.cpu().numpy()
uv = f0 == 0
f0 = norm_f0(f0, uv, hparams)

Hey guys I found you set uv = f0 = 0 at line 50, what's the intension behind this? Is that a typo or I miss something really important?

help! data generate erros occured

when I used binarize.py to generate datas, the erros had been occured .

Traceback (most recent call last):
File "/home/saxsax/svs/diff/utils/multiprocess_utils.py", line 13, in chunked_worker
res = map_func(*arg)
File "/home/saxsax/svs/diff/data_gen/singing/binarize.py", line 167, in process_item
cls.get_pitch(wav, mel, res)
File "/home/saxsax/svs/diff/data_gen/tts/base_binarizer.py", line 201, in get_pitch
f0, pitch_coarse = get_pitch(wav, mel, hparams)
File "/home/saxsax/svs/diff/data_gen/tts/data_gen_utils.py", line 174, in get_pitch
f0 = np.pad(f0, [[lpad, rpad]], mode='constant')
File "<array_function internals>", line 5, in pad
File "/home/saxsax/miniconda3/envs/pytorch/lib/python3.9/site-packages/numpy/lib/arraypad.py", line 743, in pad
pad_width = _as_pairs(pad_width, array.ndim, as_index=True)
File "/home/saxsax/miniconda3/envs/pytorch/lib/python3.9/site-packages/numpy/lib/arraypad.py", line 510, in _as_pairs
raise ValueError("index can't contain negative values")

Boundary Prediction

Hi, great job! I have a doubt, my calculated K_step is always equal to timesteps. Can you provide the calculation code related to Boundary Prediction?

Trian Opencpop data failed

I fellow the document, which is in usr/configs/midi/readme-e2e.md.
After I run train command, I got error of [Errno 2] No such file or directory: 'data/binary/opencpop-midi-dp/phone_set.json'.
I try to use usr/configs/midi/e2e/opencpop/ds100_adj_rel.yaml to exec data_gen/tts/bin/binarize.py, still got error.
FileNotFoundError: [Errno 2] No such file or directory: 'data/raw/popcs/transcriptions.txt'.
Is this mean, even I use opencpop, I still need get popcs data?

The model takes the phoneme duration as input when inference?

Thanks for your wonderful work!
I was running the inference of 0128_opencpop_ds58_midi , but there's a problem that bothers me.

https://github.com/MoonInTheRiver/DiffSinger/blob/master/tasks/tts/fs2.py#L348

    ############
    # infer
    ############
    def test_step(self, sample, batch_idx):
        spk_embed = sample.get('spk_embed') if not hparams['use_spk_id'] else sample.get('spk_ids')
        txt_tokens = sample['txt_tokens']
        mel2ph, uv, f0 = None, None, None
        ref_mels = None
        if hparams['profile_infer']:
            pass
        else:
            if hparams['use_gt_dur']:
                mel2ph = sample['mel2ph']
            if hparams['use_gt_f0']:
                f0 = sample['f0']
                uv = sample['uv']
                print('Here using gt f0!!')
            if hparams.get('use_midi') is not None and hparams['use_midi']:
                outputs = self.model(
                    txt_tokens, spk_embed=spk_embed, mel2ph=mel2ph, f0=f0, uv=uv, ref_mels=ref_mels, infer=True,
                    pitch_midi=sample['pitch_midi'], midi_dur=sample.get('midi_dur'), is_slur=sample.get('is_slur'))
            else:
                outputs = self.model(
                    txt_tokens, spk_embed=spk_embed, mel2ph=mel2ph, f0=f0, uv=uv, ref_mels=ref_mels, infer=True)

The param use_gt_dur is True, that is, the model takes the phoneme duration as input when inference.
Is it correct?

Data Preparation not working?

Thank you for sharing this project!
I am trying to run inference on a pre-trained model (DiffSinger), following the directions in the README. I have downloaded your dataset preview and trained models. I have tried both using a symlink to the dataset as instructed and also placing everything in data/processed/popcs/ directly.

At step 1, packing the dataset, I seem to run into a problem:

(DiffSinger) user@ubuntu:~/.../DiffSinger$ CUDA_VISIBLE_DEVICES=0 python data_gen/tts/bin/binarize.py --config usr/configs/popcs_ds_beta6.yaml

| Hparams chains:  ['configs/config_base.yaml', 'configs/tts/base.yaml', 'configs/tts/fs2.yaml', 'configs/tts/base_zh.yaml', 'configs/singing/base.yaml', 'usr/configs/base.yaml', 'usr/configs/popcs_ds_beta6.yaml']
| Hparams: 
K_step: 51, accumulate_grad_batches: 1, audio_num_mel_bins: 80, audio_sample_rate: 24000, base_config: ['configs/tts/fs2.yaml', 'configs/singing/base.yaml', './base.yaml'], 
binarization_args: {'shuffle': False, 'with_txt': True, 'with_wav': True, 'with_align': True, 'with_spk_embed': False, 'with_f0': True, 'with_f0cwt': True}, binarizer_cls: data_gen.singing.binarize.SingingBinarizer, binary_data_dir: data/binary/popcs-pmf0, check_val_every_n_epoch: 10, clip_grad_norm: 1, 
content_cond_steps: [], cwt_add_f0_loss: False, cwt_hidden_size: 128, cwt_layers: 2, cwt_loss: l1, 
cwt_std_scale: 0.8, datasets: ['popcs'], debug: False, dec_ffn_kernel_size: 9, dec_layers: 4, 
decay_steps: 50000, decoder_type: fft, dict_dir: , diff_decoder_type: wavenet, diff_loss_type: l1, 
dilation_cycle_length: 1, dropout: 0.1, ds_workers: 4, dur_enc_hidden_stride_kernel: ['0,2,3', '0,2,3', '0,1,3'], dur_loss: mse, 
dur_predictor_kernel: 3, dur_predictor_layers: 2, enc_ffn_kernel_size: 9, enc_layers: 4, encoder_K: 8, 
encoder_type: fft, endless_ds: True, ffn_act: gelu, ffn_padding: SAME, fft_size: 512, 
fmax: 12000, fmin: 30, fs2_ckpt: , gen_dir_name: , gen_tgt_spk_id: -1, 
hidden_size: 256, hop_size: 128, infer: False, keep_bins: 80, lambda_commit: 0.25, 
lambda_energy: 0.0, lambda_f0: 0.0, lambda_ph_dur: 0.0, lambda_sent_dur: 0.0, lambda_uv: 0.0, 
lambda_word_dur: 0.0, load_ckpt: , log_interval: 100, loud_norm: False, lr: 0.001, 
max_beta: 0.06, max_epochs: 1000, max_eval_sentences: 1, max_eval_tokens: 60000, max_frames: 5000, 
max_input_tokens: 1550, max_sentences: 48, max_tokens: 20000, max_updates: 160000, mel_loss: ssim:0.5|l1:0.5, 
mel_vmax: 1.5, mel_vmin: -6, min_level_db: -120, norm_type: gn, num_ckpt_keep: 3, 
num_heads: 2, num_sanity_val_steps: 1, num_spk: 1, num_test_samples: 0, num_valid_plots: 10, 
optimizer_adam_beta1: 0.9, optimizer_adam_beta2: 0.98, out_wav_norm: False, pitch_ar: False, pitch_enc_hidden_stride_kernel: ['0,2,5', '0,2,5', '0,2,5'], 
pitch_extractor: parselmouth, pitch_loss: l1, pitch_norm: log, pitch_type: frame, pre_align_args: {'use_tone': False, 'forced_align': 'mfa', 'use_sox': True, 'txt_processor': 'zh_g2pM', 'allow_no_txt': False, 'denoise': False}, 
pre_align_cls: data_gen.singing.pre_align.SingingPreAlign, predictor_dropout: 0.5, predictor_grad: 0.0, predictor_hidden: -1, predictor_kernel: 5, 
predictor_layers: 2, prenet_dropout: 0.5, prenet_hidden_size: 256, pretrain_fs_ckpt: , processed_data_dir: data/processed/popcs, 
profile_infer: False, raw_data_dir: data/raw/popcs, ref_norm_layer: bn, reset_phone_dict: True, residual_channels: 256, 
residual_layers: 20, save_best: False, save_ckpt: True, save_codes: ['configs', 'modules', 'tasks', 'utils', 'usr'], save_f0: True, 
save_gt: False, schedule_type: linear, seed: 1234, sort_by_len: True, spec_max: [0.2645, 0.0583, -0.2344, -0.0184, 0.1227, 0.1533, 0.1103, 0.1212, 0.2421, 0.1809, 0.2134, 0.3161, 0.3301, 0.3289, 0.2667, 0.2421, 0.2581, 0.26, 0.1394, 0.1907, 0.1082, 0.1474, 0.168, 0.255, 0.1057, 0.0826, 0.0423, 0.1203, -0.0701, -0.0056, 0.0477, -0.0639, -0.0272, -0.0728, -0.1648, -0.0855, -0.2652, -0.1998, -0.1547, -0.2167, -0.4181, -0.5463, -0.4161, -0.4733, -0.6518, -0.5387, -0.429, -0.4191, -0.4151, -0.3042, -0.381, -0.416, -0.4496, -0.2847, -0.4676, -0.4658, -0.4931, -0.4885, -0.5547, -0.5481, -0.6948, -0.7968, -0.8455, -0.8392, -0.877, -0.952, -0.8749, -0.7297, -0.8374, -0.8667, -0.7157, -0.9035, -0.9219, -0.8801, -0.9298, -0.9009, -0.9604, -1.0537, -1.0781, -1.3766], 
spec_min: [-6.8276, -7.027, -6.8142, -7.1429, -7.6669, -7.6, -7.1148, -6.964, -6.8414, -6.6596, -6.688, -6.7439, -6.7986, -7.494, -7.7845, -7.6586, -6.9288, -6.7639, -6.9118, -6.8246, -6.7183, -7.1769, -6.9794, -7.4513, -7.3422, -7.5623, -6.961, -6.8158, -6.9595, -6.8403, -6.5688, -6.6356, -7.0209, -6.5002, -6.7819, -6.5232, -6.6927, -6.5701, -6.5531, -6.7069, -6.6462, -6.4523, -6.5954, -6.4264, -6.4487, -6.707, -6.4025, -6.3042, -6.4008, -6.3857, -6.3903, -6.3094, -6.2491, -6.3518, -6.3566, -6.4168, -6.2481, -6.3624, -6.2858, -6.2575, -6.3638, -6.452, -6.1835, -6.2754, -6.1253, -6.1645, -6.0638, -6.1262, -6.071, -6.1039, -6.4428, -6.1363, -6.1054, -6.1252, -6.1797, -6.0235, -6.0758, -5.9453, -6.0213, -6.0446], spk_cond_steps: [], stop_token_weight: 5.0, task_cls: usr.diffsinger_task.DiffSingerTask, test_ids: [], 
test_input_dir: , test_num: 0, test_prefixes: ['popcs-่ฏดๆ•ฃๅฐฑๆ•ฃ', 'popcs-้šๅฝข็š„็ฟ…่†€'], test_set_name: test, timesteps: 100, 
train_set_name: train, use_denoise: False, use_energy_embed: False, use_gt_dur: True, use_gt_f0: True, 
use_nsf: True, use_pitch_embed: True, use_pos_embed: True, use_spk_embed: False, use_spk_id: False, 
use_split_spk_id: False, use_uv: True, use_var_enc: False, val_check_interval: 2000, valid_num: 0, 
valid_set_name: valid, validate: False, vocoder: vocoders.hifigan.HifiGAN, vocoder_ckpt: checkpoints/0109_hifigan_bigpopcs_hop128, warmup_updates: 2000, 
weight_decay: 0, win_size: 512, work_dir: , 
| Binarizer:  <class 'data_gen.singing.binarize.SingingBinarizer'>
| spk_map:  {}
| Build phone set:  []
0it [00:00, ?it/s]
| valid total duration: 0.000s
0it [00:00, ?it/s]
| test total duration: 0.000s
0it [00:00, ?it/s]
| train total duration: 0.000s

It creates the folder data/binary/popcs-pmf0 with 11 files, but they seem to be essentially empty.
Can you please tell what I am missing, why it does not find or use the dataset?

Question about Denoiser Residual Block

Hello, I have a question about the structure of the denoiser's residual blocks.
The paper states that the model was inspired by DiffWave, but there is a slight difference between their architectures.
DiffSinger directly adds the step embedding to the input in every block, whereas in DiffWave every block has an extra FC layer that processes the step embedding before adding it.
What is the reason behind this change? Is there a performance difference?

Possibly missing file between Step 1 and 2 for SVS?

Hi, thank you very much for your valuable SVS corpus and code.

I strictly follow your instruction until step "2. Training Example" for SVS, in https://github.com/MoonInTheRiver/DiffSinger . Then I am somewhat stuck here. The error message is:


Validation sanity check: 0%| | 0/1 [00:00<?, ?batch/s]
Traceback (most recent call last):
File "tasks/run.py", line 19, in
run_task()
File "tasks/run.py", line 14, in run_task
task_cls.start()
File "/data/juicefs_speech_tts_v2/public_data/tts_public_data/11090357/singing/diffsinger/DiffSinger/tasks/base_task.py", line 256, in start
trainer.fit(task)
File "/data/juicefs_speech_tts_v2/public_data/tts_public_data/11090357/singing/diffsinger/DiffSinger/utils/pl_utils.py", line 489, in fit
self.run_pretrain_routine(model)
File "/data/juicefs_speech_tts_v2/public_data/tts_public_data/11090357/singing/diffsinger/DiffSinger/utils/pl_utils.py", line 565, in run_pretrain_routine
self.evaluate(model, self.get_val_dataloaders(), self.num_sanity_val_steps, self.testing)
File "/data/juicefs_speech_tts_v2/public_data/tts_public_data/11090357/singing/diffsinger/DiffSinger/utils/pl_utils.py", line 1173, in evaluate
for batch_idx, batch in enumerate(dataloader):
File "/root/miniconda3/envs/diffsinger/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 363, in next
data = self._next_data()
File "/root/miniconda3/envs/diffsinger/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 989, in _next_data
return self._process_data(data)
File "/root/miniconda3/envs/diffsinger/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1014, in _process_data
data.reraise()
File "/root/miniconda3/envs/diffsinger/lib/python3.8/site-packages/torch/_utils.py", line 395, in reraise
raise self.exc_type(msg)
FileNotFoundError: Caught FileNotFoundError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/root/miniconda3/envs/diffsinger/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 185, in _worker_loop
data = fetcher.fetch(index)
File "/root/miniconda3/envs/diffsinger/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/root/miniconda3/envs/diffsinger/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 44, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/data/juicefs_speech_tts_v2/public_data/tts_public_data/11090357/singing/diffsinger/DiffSinger/usr/diffsinger_task.py", line 93, in getitem
fs2_mel = torch.Tensor(np.load(f'{fs2_ckpt}/P_mels_npy/{item_name}.npy')) # ~M generated by FFT-singer.
File "/root/miniconda3/envs/diffsinger/lib/python3.8/site-packages/numpy/lib/npyio.py", line 416, in load
fid = stack.enter_context(open(os_fspath(file), "rb"))
FileNotFoundError: [Errno 2] No such file or directory: 'checkpoints/popcs_fs2_pmf0_1230/P_mels_npy/popcs-่ฏดๆ•ฃๅฐฑๆ•ฃ-0000.npy'


It seems that the required file is not properly process in "1. Data Preparation" step, though the first step was passed successfully with the following prompt:


test_input_dir: , test_num: 0, test_prefixes: ['popcs-่ฏดๆ•ฃๅฐฑๆ•ฃ', 'popcs-้šๅฝข็š„็ฟ…่†€'], test_set_name: test, timesteps: 100,
train_set_name: train, use_denoise: False, use_energy_embed: False, use_gt_dur: True, use_gt_f0: True,
use_nsf: True, use_pitch_embed: True, use_pos_embed: True, use_spk_embed: False, use_spk_id: False,
use_split_spk_id: False, use_uv: True, use_var_enc: False, val_check_interval: 2000, valid_num: 0,
valid_set_name: valid, validate: False, vocoder: vocoders.hifigan.HifiGAN, vocoder_ckpt: checkpoints/0109_hifigan_bigpopcs_hop128, warmup_updates: 2000,
weight_decay: 0, win_size: 512, work_dir: ,
| Binarizer: <class 'data_gen.singing.binarize.SingingBinarizer'>
| spk_map: {'SPK1': 0}
| Build phone set: ['', '', '', 'a', 'ai', 'an', 'ang', 'ao', 'b', 'c', 'ch', 'd', 'e', 'ei', 'en', 'eng', 'er', 'f', 'g', 'h', 'i', 'ia', 'ian', 'iang', 'iao', 'ie', 'in', 'ing', 'iong', 'iou', 'j', 'k', 'l', 'm', 'n', 'o', 'ong', 'ou', 'p', 'q', 'r', 's', 'sh', 't', 'u', 'ua', 'uai', 'uan', 'uang', 'uei', 'uen', 'uo', 'v', 'van', 've', 'vn', 'x', 'z', 'zh', '|']
100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 27/27 [00:13<00:00, 2.01it/s]
| valid total duration: 330.677s
100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 27/27 [00:13<00:00, 2.04it/s]
| test total duration: 330.677s
100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 1624/1624 [11:55<00:00, 2.27it/s]
| train total duration: 20878.560s


I guess the output of Step 1 and input of Step 2 are possibly not chained perfectly. Any help or hints will be welcome. Thank you in advance.

multi speaker sing

By enable the "with_spk_embed" option, and then retaining the model, can it support multi speaker singing?

How to inference on other <midi, text> files using the opencpop phone_set.json?

Hi, thanks for the great work!
I want to inference on my own <midi, text> files. I generated the corresponding meta.json and tried to binarize it. But the binarizer can only generate phone_set.json based on the input files, which will be incompatible with the pretrained checkpoint (since it was pretrained on opencpop with opencpop phone_set.json). A solution is to preprocess the customized testset with the original opencpop meta.json to obtain the full phone_set.json (i.e., the same phoneme dict). However, it is too inconvenient to do that thing.

How to convert the notes in 'opencpop/segments/transcriptions'?

Thank you very much for your great work. I notice that in the original transcription file of opencpop, the notes are like this: G#4/Ab4 G#4/Ab4 G#4/Ab4 G#4/Ab4 F#4/Gb4 F#4/Gb4 F#4/Gb4 F#4/Gb4 E4 E4 E4 E4 D#4/Eb4 D#4/Eb4 D#4/Eb4 D#4/Eb4 rest E4 E4 E4 E4 rest. Could you please tell me how do you convert the notes into numbers like "68 68 68 68 66 66 66..." as you've mentioned in https://github.com/MoonInTheRiver/DiffSinger/blob/master/usr/configs/midi/cascade/opencs/eg_opencpop.png? (I'm
new to music and midi)
Thanks:)

CUDA: out of memory

hi,team๏ผš

I am try to training with command CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config usr/configs/midi/e2e/opencpop/ds100_adj_rel.yaml --exp_name $MY_DS_EXP_NAME --reset, but the error shows RuntimeError: CUDA out of memory. Tried to allocate 36.00 MiB (GPU 0; 8.00 GiB total capacity; 6.08 GiB already allocated; 0 bytes free; 6.81 GiB reserved in total by PyTorch).

it's a 8GB memory GPU 3060, how can we configure something let them to work?

How to train another language?

I think this "DiffSinger" model is based on Chinese.
Please give me advice on how to train them in another language.
Thank you for share!

Determining the durations of segmentation operators (|)

The MFA outputs don't really provide the durations/frames between the words, and I checked that this project uses the duration of the SEG token (word separator). It is many times 0 and other times not, so I wanted to ask how did you get that on preprocessing step?

Some details about re-training MFA?

Thank you very much for providing PopCS for free~!
When I reading your paper, I noticed you re-trained a Montreal Forced Aligner tool to build the dataset PopCS. Would you please provide some training details, such as, 1) what data is used as the training data? 2) the amount of training data.

Error: different shape of model parameter when generate example using diffsinger

05/14 04:33:48 AM gpu available: True, used: True
| model Arch:  OfflineGaussianDiffusion
.... 


Traceback (most recent call last):
  File "tasks/run.py", line 15, in <module>
    run_task()
  File "tasks/run.py", line 10, in run_task
    task_cls.start()
  File "/mnt/DiffSinger/tasks/base_task.py", line 258, in start
    trainer.test(task)
  File "/mnt/DiffSinger/utils/pl_utils.py", line 586, in test
    self.fit(model)
  File "/mnt/DiffSinger/utils/pl_utils.py", line 489, in fit
    self.run_pretrain_routine(model)
  File "/mnt/DiffSinger/utils/pl_utils.py", line 541, in run_pretrain_routine
    self.restore_weights(model)
  File "/mnt/DiffSinger/utils/pl_utils.py", line 617, in restore_weights
    self.restore_state_if_checkpoint_exists(model)
  File "/mnt/DiffSinger/utils/pl_utils.py", line 655, in restore_state_if_checkpoint_exists
    self.restore(last_ckpt_path, self.on_gpu)
  File "/mnt/DiffSinger/utils/pl_utils.py", line 668, in restore
    model.load_state_dict(checkpoint['state_dict'], strict=False)
  File "/root/miniconda3/envs/env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1223, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for DiffSingerOfflineTask:
        size mismatch for model.fs2.encoder_embed_tokens.weight: copying a param with shape torch.Size([62, 256]) from checkpoint, the shape in current model is torch.Size([57, 256]).
        size mismatch for model.fs2.encoder.embed_tokens.weight: copying a param with shape torch.Size([62, 256]) from checkpoint, the shape in current model is torch.Size([57, 256]).

I got this runtime error when i tried
CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config usr/configs/popcs_ds_beta6_offline.yaml --exp_name popcs_ds_beta6_offline_pmf0_1230 --reset --infer
exactly the same in README.
I change the version of scipy and torch a little bit, i dont now if it is the problem.

Basically I just followed the instruction:
I have put

[DiffSinger](https://github.com/MoonInTheRiver/DiffSinger/releases/download/pretrain-model/popcs_ds_beta6_offline_pmf0_1230.zip)็š„้ข„่ฎญ็ปƒๆจกๅž‹;
[FFT-Singer](https://github.com/MoonInTheRiver/DiffSinger/releases/download/pretrain-model/popcs_fs2_pmf0_1230.zip)็š„้ข„่ฎญ็ปƒๆจกๅž‹, ่ฟ™ๆ˜ฏไธบไบ†DiffSinger้‡Œ็š„ๆต…ๆ‰ฉๆ•ฃๆœบๅˆถ;

in checkpoints
and put data-example in /DiffSinger/data/processed/popcs/popcs-่ฏดๆ•ฃๅฐฑๆ•ฃ, and preprocess it, got mnt/DiffSinger/data/binary/popcs-pmf0.

I suspect I got something wrong
or just the author unintentionally give out a ill-shaped pretrain model....?

about 'An Easier Trick for Boundary Prediction'

In your paper, we can get the predicted boundary as follows:
image

then I implemented 'An Easier Trick for Boundary Prediction' in my repo following the trick:
https://github.com/keonlee9420/DiffSinger/blob/f849f8def5abb38ad272a384e8bec838ea1957a4/boundary_predictor.py#L14-L45

and there are some helper functions for that (please focus on expected_kld_t and expected_kld_T function):
https://github.com/keonlee9420/DiffSinger/blob/f849f8def5abb38ad272a384e8bec838ea1957a4/model/diffusion.py#L351-L389

But as I noted in my README.md (in 2. of note section), the predicted boundary of LJSpeech is 100, which is the same as the total timesteps in Naive version.

So I'd like to ask you to briefly check my implementation. Could you please take a look at it and let me know if I missed something? Why do you think my boundary predictor shows unexpected K_step?

FYI, here is the sample output log of running boundary_predictor.py:

==================================== Prediction Configuration ====================================
 ---> Total Batch Size: 48
 ---> Path of ckpt: ./output/ckpt/LJSpeech_shallow_el_4
================================================================================================
100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 11/11 [00:08<00:00,  1.34it/s]
[tensor(6959.2134, device='cuda:0'), tensor(933.3702, device='cuda:0'), tensor(403.9860, device='cuda:0'), tensor(249.2317, device='cuda:0'), tensor(183.4001, device='cuda:0'), tensor(149.2621, device='cuda:0'), tensor(129.2204, device='cuda:0'), tensor(116.2622, device='cuda:0'), tensor(107.4923, device='cuda:0'), tensor(101.0867, device='cuda:0'), tensor(96.2093, device='cuda:0'), tensor(92.4524, device='cuda:0'), tensor(89.3728, device='cuda:0'), tensor(86.7645, device='cuda:0'), tensor(84.4990, device='cuda:0'), tensor(82.5240, device='cuda:0'), tensor(80.7848, device='cuda:0'), tensor(79.1111, device='cuda:0'), tensor(77.5320, device='cuda:0'), tensor(76.0396, device='cuda:0'), tensor(74.6199, device='cuda:0'), tensor(73.2726, device='cuda:0'), tensor(71.9328, device='cuda:0'), tensor(70.6272, device='cuda:0'), tensor(69.2854, device='cuda:0'), tensor(68.0120, device='cuda:0'), tensor(66.7351, device='cuda:0'), tensor(65.4260, device='cuda:0'), tensor(64.1837, device='cuda:0'), tensor(62.9117, device='cuda:0'), tensor(61.6452, device='cuda:0'), tensor(60.3592, device='cuda:0'), tensor(59.0823, device='cuda:0'), tensor(57.8210, device='cuda:0'), tensor(56.5481, device='cuda:0'), tensor(55.2716, device='cuda:0'), tensor(54.0141, device='cuda:0'), tensor(52.7686, device='cuda:0'), tensor(51.4833, device='cuda:0'), tensor(50.2068, device='cuda:0'), tensor(48.9261, device='cuda:0'), tensor(47.6881, device='cuda:0'), tensor(46.4407, device='cuda:0'), tensor(45.2071, device='cuda:0'), tensor(43.9496, device='cuda:0'), tensor(42.7181, device='cuda:0'), tensor(41.5266, device='cuda:0'), tensor(40.2994, device='cuda:0'), tensor(39.1266, device='cuda:0'), tensor(37.9398, device='cuda:0'), tensor(36.7822, device='cuda:0'), tensor(35.6130, device='cuda:0'), tensor(34.5006, device='cuda:0'), tensor(33.3484, device='cuda:0'), tensor(32.2580, device='cuda:0'), tensor(31.1593, device='cuda:0'), tensor(30.1051, device='cuda:0'), tensor(29.0614, device='cuda:0'), tensor(28.0244, device='cuda:0'), tensor(27.0115, device='cuda:0'), tensor(26.0248, device='cuda:0'), tensor(25.0589, device='cuda:0'), tensor(24.1051, device='cuda:0'), tensor(23.1736, device='cuda:0'), tensor(22.2743, device='cuda:0'), tensor(21.3856, device='cuda:0'), tensor(20.5282, device='cuda:0'), tensor(19.6825, device='cuda:0'), tensor(18.8733, device='cuda:0'), tensor(18.0839, device='cuda:0'), tensor(17.3134, device='cuda:0'), tensor(16.5815, device='cuda:0'), tensor(15.8417, device='cuda:0'), tensor(15.1426, device='cuda:0'), tensor(14.4522, device='cuda:0'), tensor(13.8025, device='cuda:0'), tensor(13.1645, device='cuda:0'), tensor(12.5432, device='cuda:0'), tensor(11.9491, device='cuda:0'), tensor(11.3789, device='cuda:0'), tensor(10.8328, device='cuda:0'), tensor(10.2960, device='cuda:0'), tensor(9.7815, device='cuda:0'), tensor(9.2841, device='cuda:0'), tensor(8.8136, device='cuda:0'), tensor(8.3660, device='cuda:0'), tensor(7.9211, device='cuda:0'), tensor(7.5027, device='cuda:0'), tensor(7.1040, device='cuda:0'), tensor(6.7245, device='cuda:0'), tensor(6.3511, device='cuda:0'), tensor(6.0048, device='cuda:0'), tensor(5.6679, device='cuda:0'), tensor(5.3475, device='cuda:0'), tensor(5.0427, device='cuda:0'), tensor(4.7507, device='cuda:0'), tensor(4.4784, device='cuda:0'), tensor(4.2143, device='cuda:0'), tensor(3.9639, device='cuda:0'), tensor(3.7258, device='cuda:0')]
tensor(0.2382, device='cuda:0')

Predicted Boundary K is 100

Thanks in advance!

Inference with unseen songs

Hi. Since the DiffSinger(PopCS) needs ground-truth f0 information at inference, is it possible to synthesize an unseen song (with phoneme labels, phoneme duration and notes provided) using the DIffSinger(PopCS) model?

DiffSinger infer problem

I want to test opencpop preitrain model on unseen song. I don't know how to generate the wav file.

  1. What data I should prepare for model?
  2. How to do it? I saw test_step in FastSpeech2Task, but it seems for tts task. So I need override test_step in DiffSingerMIDITask? Is there other way to solve this? Without packing data into dataloader, just load model, and infer.

Proposing a fix for inconsistent f0 length caused by different versions of parselmouth

Hi,
I noticed that different versions of parselmouth would result in different length of the computed f0. This is also mentioned in the comment of your code.

# Attention: we find that new version of some libraries could cause ``rpad'' to be a negetive value...

I think the reason for this is that, in parselmouth v0.4.0, the default value of the very_accurate parameter of Sound.to_pitch_ac was changed to False, matching Praat's default (See https://github.com/YannickJadoul/Parselmouth/releases). Though I didn't make a rigorous test, I think adding very_accurate=True in get_pitch would fix the inconsistent length issue and make your code more compatible.

question about fs2 infer

Hi, thank you very much for your valuable SVS corpus and code.
I strictly follow your instruction until step "2. Training Example" for SVS, in https://github.com/MoonInTheRiver/DiffSinger . Then I am somewhat stuck here. The error message is:
Validation sanity check: 0%| | 0/1 [00:00<?, ?batch/s] Traceback (most recent call last): File "tasks/run.py", line 19, in run_task() File "tasks/run.py", line 14, in run_task task_cls.start() File "/data/juicefs_speech_tts_v2/public_data/tts_public_data/11090357/singing/diffsinger/DiffSinger/tasks/base_task.py", line 256, in start trainer.fit(task) File "/data/juicefs_speech_tts_v2/public_data/tts_public_data/11090357/singing/diffsinger/DiffSinger/utils/pl_utils.py", line 489, in fit self.run_pretrain_routine(model) File "/data/juicefs_speech_tts_v2/public_data/tts_public_data/11090357/singing/diffsinger/DiffSinger/utils/pl_utils.py", line 565, in run_pretrain_routine self.evaluate(model, self.get_val_dataloaders(), self.num_sanity_val_steps, self.testing) File "/data/juicefs_speech_tts_v2/public_data/tts_public_data/11090357/singing/diffsinger/DiffSinger/utils/pl_utils.py", line 1173, in evaluate for batch_idx, batch in enumerate(dataloader): File "/root/miniconda3/envs/diffsinger/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 363, in next data = self._next_data() File "/root/miniconda3/envs/diffsinger/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 989, in _next_data return self._process_data(data) File "/root/miniconda3/envs/diffsinger/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1014, in _process_data data.reraise() File "/root/miniconda3/envs/diffsinger/lib/python3.8/site-packages/torch/_utils.py", line 395, in reraise raise self.exc_type(msg) FileNotFoundError: Caught FileNotFoundError in DataLoader worker process 0. Original Traceback (most recent call last): File "/root/miniconda3/envs/diffsinger/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 185, in _worker_loop data = fetcher.fetch(index) File "/root/miniconda3/envs/diffsinger/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] File "/root/miniconda3/envs/diffsinger/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 44, in data = [self.dataset[idx] for idx in possibly_batched_index] File "/data/juicefs_speech_tts_v2/public_data/tts_public_data/11090357/singing/diffsinger/DiffSinger/usr/diffsinger_task.py", line 93, in getitem fs2_mel = torch.Tensor(np.load(f'{fs2_ckpt}/P_mels_npy/{item_name}.npy')) # ~M generated by FFT-singer. File "/root/miniconda3/envs/diffsinger/lib/python3.8/site-packages/numpy/lib/npyio.py", line 416, in load fid = stack.enter_context(open(os_fspath(file), "rb")) FileNotFoundError: [Errno 2] No such file or directory: 'checkpoints/popcs_fs2_pmf0_1230/P_mels_npy/popcs-่ฏดๆ•ฃๅฐฑๆ•ฃ-0000.npy'
It seems that the required file is not properly process in "1. Data Preparation" step, though the first step was passed successfully with the following prompt:
test_input_dir: , test_num: 0, test_prefixes: ['popcs-่ฏดๆ•ฃๅฐฑๆ•ฃ', 'popcs-้šๅฝข็š„็ฟ…่†€'], test_set_name: test, timesteps: 100, train_set_name: train, use_denoise: False, use_energy_embed: False, use_gt_dur: True, use_gt_f0: True, use_nsf: True, use_pitch_embed: True, use_pos_embed: True, use_spk_embed: False, use_spk_id: False, use_split_spk_id: False, use_uv: True, use_var_enc: False, val_check_interval: 2000, valid_num: 0, valid_set_name: valid, validate: False, vocoder: vocoders.hifigan.HifiGAN, vocoder_ckpt: checkpoints/0109_hifigan_bigpopcs_hop128, warmup_updates: 2000, weight_decay: 0, win_size: 512, work_dir: , | Binarizer: <class 'data_gen.singing.binarize.SingingBinarizer'> | spk_map: {'SPK1': 0} | Build phone set: ['', '', '', 'a', 'ai', 'an', 'ang', 'ao', 'b', 'c', 'ch', 'd', 'e', 'ei', 'en', 'eng', 'er', 'f', 'g', 'h', 'i', 'ia', 'ian', 'iang', 'iao', 'ie', 'in', 'ing', 'iong', 'iou', 'j', 'k', 'l', 'm', 'n', 'o', 'ong', 'ou', 'p', 'q', 'r', 's', 'sh', 't', 'u', 'ua', 'uai', 'uan', 'uang', 'uei', 'uen', 'uo', 'v', 'van', 've', 'vn', 'x', 'z', 'zh', '|'] 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 27/27 [00:13<00:00, 2.01it/s] | valid total duration: 330.677s 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 27/27 [00:13<00:00, 2.04it/s] | test total duration: 330.677s 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 1624/1624 [11:55<00:00, 2.27it/s] | train total duration: 20878.560s
I guess the output of Step 1 and input of Step 2 are possibly not chained perfectly. Any help or hints will be welcome. Thank you in advance.

Yes, you are right. There is a problem. You need run "CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config usr/configs/popcs_fs2.yaml --exp_name popcs_fs2_pmf0_1230 --reset --infer" in advance to produce the files "P_mels_npy". I have fixed the readme file. Thanks for your report!

I have a question about it, if we are training from scratch and we don't have any saved models for inference, how does the P_mels_npy(predicted mels I guess) generated?

Originally posted by @Cescfangs in #11 (comment)

AttributeError: 'HifiGAN' object has no attribute 'model'

I'm following the instructions to train a model for DiffSinger, but there seem to be some issues with vocoders/hifigan.py:

"# first run fs2 infer;"

$ CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config usr/configs/popcs_fs2.yaml --exp_name popcs_fs2_pmf0_1230 --reset --infer 
...
| model Trainable Parameters: 24.256M
  0%|                                                                                                                                                       | 0/1 [00:00<?, ?it/s]
Traceback (most recent call last):                                                                                                                          | 0/1 [00:00<?, ?it/s]
  File "tasks/run.py", line 15, in <module>
    run_task()
  File "tasks/run.py", line 10, in run_task
    task_cls.start()
  File "/.../DiffSinger/tasks/base_task.py", line 258, in start
    trainer.test(task)
  File "/.../DiffSinger/utils/pl_utils.py", line 586, in test
    self.fit(model)
  File "/.../DiffSinger/utils/pl_utils.py", line 489, in fit
    self.run_pretrain_routine(model)
  File "/.../DiffSinger/utils/pl_utils.py", line 545, in run_pretrain_routine
    self.run_evaluation(test=True)
  File "/.../DiffSinger/utils/pl_utils.py", line 1245, in run_evaluation
    eval_results = self.evaluate(self.model,
  File "/.../DiffSinger/utils/pl_utils.py", line 1185, in evaluate
    output = self.evaluation_forward(model,
  File "/.../DiffSinger/utils/pl_utils.py", line 1307, in evaluation_forward
    output = model.test_step(*args)
  File "/.../DiffSinger/tasks/tts/fs2.py", line 363, in test_step
    return self.after_infer(sample)
  File "/.../DiffSinger/tasks/tts/fs2.py", line 410, in after_infer
    wav_pred = self.vocoder.spec2wav(mel_pred, f0=f0_pred)
  File "/.../DiffSinger/vocoders/hifigan.py", line 56, in spec2wav
    device = self.device
AttributeError: 'HifiGAN' object has no attribute 'device'
Testing:   0%|          | 0/14 [00:01<?, ?batch/s]

So I've added self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu") to the __init__ of HifiGAN, but then I get this:

| model Trainable Parameters: 24.256M
  0%|                                                                                                                                                       | 0/1 [00:00<?, ?it/s]
Traceback (most recent call last):                                                                                                                          | 0/1 [00:00<?, ?it/s]
  File "tasks/run.py", line 15, in <module>
    run_task()
  File "tasks/run.py", line 10, in run_task
    task_cls.start()
  File "/.../DiffSinger/tasks/base_task.py", line 258, in start
    trainer.test(task)
  File "/.../DiffSinger/utils/pl_utils.py", line 586, in test
    self.fit(model)
  File "/.../DiffSinger/utils/pl_utils.py", line 489, in fit
    self.run_pretrain_routine(model)
  File "/.../DiffSinger/utils/pl_utils.py", line 545, in run_pretrain_routine
    self.run_evaluation(test=True)
  File "/.../DiffSinger/utils/pl_utils.py", line 1245, in run_evaluation
    eval_results = self.evaluate(self.model,
  File "/.../DiffSinger/utils/pl_utils.py", line 1185, in evaluate
    output = self.evaluation_forward(model,
  File "/.../DiffSinger/utils/pl_utils.py", line 1307, in evaluation_forward
    output = model.test_step(*args)
  File "/.../DiffSinger/tasks/tts/fs2.py", line 363, in test_step
    return self.after_infer(sample)
  File "/.../DiffSinger/tasks/tts/fs2.py", line 410, in after_infer
    wav_pred = self.vocoder.spec2wav(mel_pred, f0=f0_pred)
  File "/.../DiffSinger/vocoders/hifigan.py", line 64, in spec2wav
    y = self.model(c, f0).view(-1)
AttributeError: 'HifiGAN' object has no attribute 'model'
Testing:   0%|          | 0/14 [00:01<?, ?batch/s]   

about some missing parts

Hi, thanks for your contribution on DiffSinger! and also thanks for mentioning my implementation, I just realized it yesterday:)

With your detailed documentation in README and paper, I can reproduce the training & inference procedure and the results with this repo. But during that, I found some missing parts to get the full training with shallow version: I think the current code only supports forced K (which is 71) with the pre-trained FastSpeech2 (especially of the decoder). If I understood correctly, we need a process for the boundary prediction and pre-training of FastSpeech2 before training DiffSpeech in shallow. Maybe I missed somewhere in the repo, but if it is not yet pushed, I wonder whether you have planned to provide that part soon or not.

Thanks in advance!

Best,
keon

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.