moonintheriver / diffsinger Goto Github PK

DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism (SVS & TTS); AAAI 2022; Official code

License: MIT License

Python 100.00%

text-to-speech diffusion-speedup tts aaai2022 singing-synthesis diffusion-model speech-synthesis singing-voice-synthesis singing-voice singing-voice-database midi

diffsinger's Introduction

DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism

This repository is the official PyTorch implementation of our AAAI-2022 paper, in which we propose DiffSinger (for Singing-Voice-Synthesis) and DiffSpeech (for Text-to-Speech).

🎉 🎉 🎉 Updates:

Sep.11, 2022: 🔌 DiffSinger-PN. Add plug-in PNDM, ICLR 2022 in our laboratory, to accelerate DiffSinger freely.
Jul.27, 2022: Update documents for SVS. Add easy inference A & B; Add Interactive SVS running on HuggingFace🤗 SVS.
Mar.2, 2022: MIDI-B-version.
Mar.1, 2022: NeuralSVB, for singing voice beautifying, has been released.
Feb.13, 2022: NATSpeech, the improved code framework, which contains the implementations of DiffSpeech and our NeurIPS-2021 work PortaSpeech has been released.
Jan.29, 2022: support MIDI-A-version SVS.
Jan.13, 2022: support SVS, release PopCS dataset.
Dec.19, 2021: support TTS. HuggingFace🤗 TTS

🚀 News:

Feb.24, 2022: Our new work, NeuralSVB was accepted by ACL-2022 . Demo Page.
Dec.01, 2021: DiffSinger was accepted by AAAI-2022.
Sep.29, 2021: Our recent work PortaSpeech: Portable and High-Quality Generative Text-to-Speech was accepted by NeurIPS-2021 .
May.06, 2021: We submitted DiffSinger to Arxiv .

Environments

If you want to use env of anaconda:

conda create -n your_env_name python=3.8
source activate your_env_name 
pip install -r requirements_2080.txt   (GPU 2080Ti, CUDA 10.2)
or pip install -r requirements_3090.txt   (GPU 3090, CUDA 11.4)

Or, if you want to use virtual env of python:

## Install Python 3.8 first. 
python -m venv venv
source venv/bin/activate
# install requirements.
pip install -U pip
pip install Cython numpy==1.19.1
pip install torch==1.9.0
pip install -r requirements.txt

Documents

Overview

Mel Pipeline	Dataset	Pitch Input	F0 Prediction	Acceleration Method	Vocoder
DiffSpeech (Text->F0, Text+F0->Mel, Mel->Wav)	Ljspeech	None	Explicit	Shallow Diffusion	HiFiGAN
DiffSinger (Lyric+F0->Mel, Mel->Wav)	PopCS	Ground-Truth F0	None	Shallow Diffusion	NSF-HiFiGAN
DiffSinger (Lyric+MIDI->F0, Lyric+F0->Mel, Mel->Wav)	OpenCpop	MIDI	Explicit	Shallow Diffusion	NSF-HiFiGAN
FFT-Singer (Lyric+MIDI->F0, Lyric+F0->Mel, Mel->Wav)	OpenCpop	MIDI	Explicit	Invalid	NSF-HiFiGAN
DiffSinger (Lyric+MIDI->Mel, Mel->Wav)	OpenCpop	MIDI	Implicit	None	Pitch-Extractor + NSF-HiFiGAN
DiffSinger+PNDM (Lyric+MIDI->Mel, Mel->Wav)	OpenCpop	MIDI	Implicit	PLMS	Pitch-Extractor + NSF-HiFiGAN
DiffSpeech+PNDM (Text->Mel, Mel->Wav)	Ljspeech	None	Implicit	PLMS	HiFiGAN

Tensorboard

tensorboard --logdir_spec exp_name

Citation

@article{liu2021diffsinger,
  title={Diffsinger: Singing voice synthesis via shallow diffusion mechanism},
  author={Liu, Jinglin and Li, Chengxi and Ren, Yi and Chen, Feiyang and Liu, Peng and Zhao, Zhou},
  journal={arXiv preprint arXiv:2105.02446},
  volume={2},
  year={2021}}

Acknowledgements

lucidrains' denoising-diffusion-pytorch
Official PyTorch Lightning
kan-bayashi's ParallelWaveGAN
jik876's HifiGAN
Official espnet
lmnt-com's DiffWave
keonlee9420's Implementation.

Especially thanks to:

Team Openvpi's maintenance: DiffSinger.
Your re-creation and sharing.

diffsinger's People

Contributors

Stargazers

Watchers

Forkers

shaun95 hongwen-sun rookie-chenfy hertz-pj wangfn chenchy haifengzeng idgmatrix zhyoung24 zhangsanfeng86 kevink2424 ishine markyouyuren wendonggan soonbeomchoi kastnerkyle techthiyanes zjumml zp75296383 bagzhan-hub chunhui-lu macroustc silyfox fenganyang riteahoo yuxinyuan hdmjdp johndpope bobycv06fpm imiskolee newportchen krong-krong maddigit andyweiqiu watersoft123 human2b ahgaoyong netpi jfsantos seaninsampa adambear bbboobb awmmmm lukelluke guoyang94 xxxin1 dstansice qhduan adam-mehdi robotpin warma16 cryptoholder-la jobsecond danielghb openvpi vtuber-plan 332plim huhai463127310 ihonliu xiaozhuo12138 yxlllc sapphire008 axfox hekmon8 nekocirno luping-liu crygirl02 faker2048 qiaolinwang fengyezhi miguangsun south-twilight wonsgong quantash saber5433 jackliaoall-ai-voice zhangmikoto tuannvhust brunotech lovebabies dingguijin gqy2468 nnsvs eltociear xiaoxue1117 mumutoy banacorn1 criticalpulsar asmedeus998 atulshukla jiahong3837 cemberk wiinew esoff alexanderxuan misakamikoto96 wentsy pfxjacky xiaocdh david20080125

diffsinger's Issues

The calculation of spec_min and spec_max

Hi. I notice that for each dataset, you calculated and put the spec_min & spec_max in the config files (e.g. https://github.com/MoonInTheRiver/DiffSinger/blob/c2fb5b32502e1e7e4b2a077bd9d83bb1c39e2b4e/usr/configs/popcs_ds_beta6.yaml). How did you calculate these features? (so that we can calculate the consistent features with you).

Thank you very much.

Why it need to infer fs2 before training ds?

# first run fs2 infer;
CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config usr/configs/popcs_fs2.yaml --exp_name popcs_fs2_pmf0_1230 --reset --infer 
# second run ds train;
CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config usr/configs/popcs_ds_beta6_offline.yaml --exp_name popcs_exp2 --reset

Help to process the apply

Hi @MoonInTheRiver ,

I have apply, please help to process.

Thank you!

Kevin

is the ph_dur from mel2ph or predicting by model？

When you do infer for diffsinger midi version， the ph_dur from where ?
I read code think it from mel2ph, why not use the dur that predicted from model?

使用readme的预训练模型和步骤跑出来的语音不符合预期

使用readme的预训练模型和步骤跑出来的语音，是完全乱码的。期间没有报错，请问是什么原因啊

 [popcs-说散就散-0001][P]算了吧@我付出过甚么没关系@我忽略自己@就因为遇见你-popcs_exp2.wav.zip

size mismatch for model.encoder_embed_tokens.weight: copying a param with shape torch.Size([62, 256]) from checkpoint, the shape in current model is torch.Size([57, 256]).

Having successfully run step 1, data preparation, I am now trying to run inference. I am using the given dataset preview.
Running CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config usr/configs/popcs_fs2.yaml --exp_name popcs_fs2_pmf0_1230 --reset --infer according to the readme.md, I end up with this error:

| model Trainable Parameters: 24.253M
Traceback (most recent call last):
  File "tasks/run.py", line 15, in <module>
    run_task()
  File "tasks/run.py", line 10, in run_task
    task_cls.start()
  File "/.../DiffSinger/tasks/base_task.py", line 258, in start
    trainer.test(task)
  File "/.../DiffSinger/utils/pl_utils.py", line 586, in test
    self.fit(model)
  File "/.../DiffSinger/utils/pl_utils.py", line 489, in fit
    self.run_pretrain_routine(model)
  File "/.../DiffSinger/utils/pl_utils.py", line 541, in run_pretrain_routine
    self.restore_weights(model)
  File "/.../DiffSinger/utils/pl_utils.py", line 617, in restore_weights
    self.restore_state_if_checkpoint_exists(model)
  File "/.../DiffSinger/utils/pl_utils.py", line 655, in restore_state_if_checkpoint_exists
    self.restore(last_ckpt_path, self.on_gpu)
  File "/.../DiffSinger/utils/pl_utils.py", line 668, in restore
    model.load_state_dict(checkpoint['state_dict'], strict=False)
  File "/.../envs/DiffSinger/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1223, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for FastSpeech2Task:
	size mismatch for model.encoder_embed_tokens.weight: copying a param with shape torch.Size([62, 256]) from checkpoint, the shape in current model is torch.Size([57, 256]).
	size mismatch for model.encoder.embed_tokens.weight: copying a param with shape torch.Size([62, 256]) from checkpoint, the shape in current model is torch.Size([57, 256]).

Do you have any ideas on what could be wrong here and how to resolve it?

The phoneme duration label of POPCS Dataset.

Thank you for releasing the POPCS dataset, I was wondering the phoneme duration label is automatically labeled by MFA tool or labeled by human?

When will release the code of PortaSpeech ?

@MoonInTheRiver
Thank you for share!
When will release the code of PortaSpeech?

Where is phone_set.json

Hi @MoonInTheRiver ,

Thank you for your share.

where is the "phone_set.json"

why set `uv = f0 = 0` in norm_interp_f0?

DiffSinger/utils/pitch_utils.py

Lines 45 to 51 in 3d050f7

 def norm_interp_f0(f0, hparams): 

 is_torch = isinstance(f0, torch.Tensor) 

 if is_torch: 

 device = f0.device 

 f0 = f0.data.cpu().numpy() 

 uv = f0 == 0 

 f0 = norm_f0(f0, uv, hparams)

Hey guys I found you set uv = f0 = 0 at line 50, what's the intension behind this? Is that a typo or I miss something really important?

How to apply for PopCS dataset

WIP. We will provide a form to apply for PopCS dataset.

Hi，May I ask when the dataset will be made public? And Can I use DM001 from https://www.data-baker.com/images/image/001.pdf to train the DiffSinger.

Why feed in f0 in the midi version

Hi @MoonInTheRiver ，

In the midi version, why also feed in f0 and uv?

f0 and uv is generated from raw wav, but during the infer, only txt_token and midi are given, how to get f0 and uv?

help! data generate erros occured

when I used binarize.py to generate datas, the erros had been occured .

Traceback (most recent call last):
File "/home/saxsax/svs/diff/utils/multiprocess_utils.py", line 13, in chunked_worker
res = map_func(*arg)
File "/home/saxsax/svs/diff/data_gen/singing/binarize.py", line 167, in process_item
cls.get_pitch(wav, mel, res)
File "/home/saxsax/svs/diff/data_gen/tts/base_binarizer.py", line 201, in get_pitch
f0, pitch_coarse = get_pitch(wav, mel, hparams)
File "/home/saxsax/svs/diff/data_gen/tts/data_gen_utils.py", line 174, in get_pitch
f0 = np.pad(f0, [[lpad, rpad]], mode='constant')
File "<array_function internals>", line 5, in pad
File "/home/saxsax/miniconda3/envs/pytorch/lib/python3.9/site-packages/numpy/lib/arraypad.py", line 743, in pad
pad_width = _as_pairs(pad_width, array.ndim, as_index=True)
File "/home/saxsax/miniconda3/envs/pytorch/lib/python3.9/site-packages/numpy/lib/arraypad.py", line 510, in _as_pairs
raise ValueError("index can't contain negative values")

Boundary Prediction

Hi, great job! I have a doubt, my calculated K_step is always equal to timesteps. Can you provide the calculation code related to Boundary Prediction?

Trian Opencpop data failed

I fellow the document, which is in usr/configs/midi/readme-e2e.md.
After I run train command, I got error of [Errno 2] No such file or directory: 'data/binary/opencpop-midi-dp/phone_set.json'.
I try to use usr/configs/midi/e2e/opencpop/ds100_adj_rel.yaml to exec data_gen/tts/bin/binarize.py, still got error.
FileNotFoundError: [Errno 2] No such file or directory: 'data/raw/popcs/transcriptions.txt'.
Is this mean, even I use opencpop, I still need get popcs data?

The model takes the phoneme duration as input when inference?

Thanks for your wonderful work!
I was running the inference of 0128_opencpop_ds58_midi , but there's a problem that bothers me.

https://github.com/MoonInTheRiver/DiffSinger/blob/master/tasks/tts/fs2.py#L348

    ############
    # infer
    ############
    def test_step(self, sample, batch_idx):
        spk_embed = sample.get('spk_embed') if not hparams['use_spk_id'] else sample.get('spk_ids')
        txt_tokens = sample['txt_tokens']
        mel2ph, uv, f0 = None, None, None
        ref_mels = None
        if hparams['profile_infer']:
            pass
        else:
            if hparams['use_gt_dur']:
                mel2ph = sample['mel2ph']
            if hparams['use_gt_f0']:
                f0 = sample['f0']
                uv = sample['uv']
                print('Here using gt f0!!')
            if hparams.get('use_midi') is not None and hparams['use_midi']:
                outputs = self.model(
                    txt_tokens, spk_embed=spk_embed, mel2ph=mel2ph, f0=f0, uv=uv, ref_mels=ref_mels, infer=True,
                    pitch_midi=sample['pitch_midi'], midi_dur=sample.get('midi_dur'), is_slur=sample.get('is_slur'))
            else:
                outputs = self.model(
                    txt_tokens, spk_embed=spk_embed, mel2ph=mel2ph, f0=f0, uv=uv, ref_mels=ref_mels, infer=True)

The param use_gt_dur is True, that is, the model takes the phoneme duration as input when inference.
Is it correct?

Data Preparation not working?

Thank you for sharing this project!
I am trying to run inference on a pre-trained model (DiffSinger), following the directions in the README. I have downloaded your dataset preview and trained models. I have tried both using a symlink to the dataset as instructed and also placing everything in data/processed/popcs/ directly.

At step 1, packing the dataset, I seem to run into a problem:

(DiffSinger) user@ubuntu:~/.../DiffSinger$ CUDA_VISIBLE_DEVICES=0 python data_gen/tts/bin/binarize.py --config usr/configs/popcs_ds_beta6.yaml

| Hparams chains:  ['configs/config_base.yaml', 'configs/tts/base.yaml', 'configs/tts/fs2.yaml', 'configs/tts/base_zh.yaml', 'configs/singing/base.yaml', 'usr/configs/base.yaml', 'usr/configs/popcs_ds_beta6.yaml']
| Hparams: 
K_step: 51, accumulate_grad_batches: 1, audio_num_mel_bins: 80, audio_sample_rate: 24000, base_config: ['configs/tts/fs2.yaml', 'configs/singing/base.yaml', './base.yaml'], 
binarization_args: {'shuffle': False, 'with_txt': True, 'with_wav': True, 'with_align': True, 'with_spk_embed': False, 'with_f0': True, 'with_f0cwt': True}, binarizer_cls: data_gen.singing.binarize.SingingBinarizer, binary_data_dir: data/binary/popcs-pmf0, check_val_every_n_epoch: 10, clip_grad_norm: 1, 
content_cond_steps: [], cwt_add_f0_loss: False, cwt_hidden_size: 128, cwt_layers: 2, cwt_loss: l1, 
cwt_std_scale: 0.8, datasets: ['popcs'], debug: False, dec_ffn_kernel_size: 9, dec_layers: 4, 
decay_steps: 50000, decoder_type: fft, dict_dir: , diff_decoder_type: wavenet, diff_loss_type: l1, 
dilation_cycle_length: 1, dropout: 0.1, ds_workers: 4, dur_enc_hidden_stride_kernel: ['0,2,3', '0,2,3', '0,1,3'], dur_loss: mse, 
dur_predictor_kernel: 3, dur_predictor_layers: 2, enc_ffn_kernel_size: 9, enc_layers: 4, encoder_K: 8, 
encoder_type: fft, endless_ds: True, ffn_act: gelu, ffn_padding: SAME, fft_size: 512, 
fmax: 12000, fmin: 30, fs2_ckpt: , gen_dir_name: , gen_tgt_spk_id: -1, 
hidden_size: 256, hop_size: 128, infer: False, keep_bins: 80, lambda_commit: 0.25, 
lambda_energy: 0.0, lambda_f0: 0.0, lambda_ph_dur: 0.0, lambda_sent_dur: 0.0, lambda_uv: 0.0, 
lambda_word_dur: 0.0, load_ckpt: , log_interval: 100, loud_norm: False, lr: 0.001, 
max_beta: 0.06, max_epochs: 1000, max_eval_sentences: 1, max_eval_tokens: 60000, max_frames: 5000, 
max_input_tokens: 1550, max_sentences: 48, max_tokens: 20000, max_updates: 160000, mel_loss: ssim:0.5|l1:0.5, 
mel_vmax: 1.5, mel_vmin: -6, min_level_db: -120, norm_type: gn, num_ckpt_keep: 3, 
num_heads: 2, num_sanity_val_steps: 1, num_spk: 1, num_test_samples: 0, num_valid_plots: 10, 
optimizer_adam_beta1: 0.9, optimizer_adam_beta2: 0.98, out_wav_norm: False, pitch_ar: False, pitch_enc_hidden_stride_kernel: ['0,2,5', '0,2,5', '0,2,5'], 
pitch_extractor: parselmouth, pitch_loss: l1, pitch_norm: log, pitch_type: frame, pre_align_args: {'use_tone': False, 'forced_align': 'mfa', 'use_sox': True, 'txt_processor': 'zh_g2pM', 'allow_no_txt': False, 'denoise': False}, 
pre_align_cls: data_gen.singing.pre_align.SingingPreAlign, predictor_dropout: 0.5, predictor_grad: 0.0, predictor_hidden: -1, predictor_kernel: 5, 
predictor_layers: 2, prenet_dropout: 0.5, prenet_hidden_size: 256, pretrain_fs_ckpt: , processed_data_dir: data/processed/popcs, 
profile_infer: False, raw_data_dir: data/raw/popcs, ref_norm_layer: bn, reset_phone_dict: True, residual_channels: 256, 
residual_layers: 20, save_best: False, save_ckpt: True, save_codes: ['configs', 'modules', 'tasks', 'utils', 'usr'], save_f0: True, 
save_gt: False, schedule_type: linear, seed: 1234, sort_by_len: True, spec_max: [0.2645, 0.0583, -0.2344, -0.0184, 0.1227, 0.1533, 0.1103, 0.1212, 0.2421, 0.1809, 0.2134, 0.3161, 0.3301, 0.3289, 0.2667, 0.2421, 0.2581, 0.26, 0.1394, 0.1907, 0.1082, 0.1474, 0.168, 0.255, 0.1057, 0.0826, 0.0423, 0.1203, -0.0701, -0.0056, 0.0477, -0.0639, -0.0272, -0.0728, -0.1648, -0.0855, -0.2652, -0.1998, -0.1547, -0.2167, -0.4181, -0.5463, -0.4161, -0.4733, -0.6518, -0.5387, -0.429, -0.4191, -0.4151, -0.3042, -0.381, -0.416, -0.4496, -0.2847, -0.4676, -0.4658, -0.4931, -0.4885, -0.5547, -0.5481, -0.6948, -0.7968, -0.8455, -0.8392, -0.877, -0.952, -0.8749, -0.7297, -0.8374, -0.8667, -0.7157, -0.9035, -0.9219, -0.8801, -0.9298, -0.9009, -0.9604, -1.0537, -1.0781, -1.3766], 
spec_min: [-6.8276, -7.027, -6.8142, -7.1429, -7.6669, -7.6, -7.1148, -6.964, -6.8414, -6.6596, -6.688, -6.7439, -6.7986, -7.494, -7.7845, -7.6586, -6.9288, -6.7639, -6.9118, -6.8246, -6.7183, -7.1769, -6.9794, -7.4513, -7.3422, -7.5623, -6.961, -6.8158, -6.9595, -6.8403, -6.5688, -6.6356, -7.0209, -6.5002, -6.7819, -6.5232, -6.6927, -6.5701, -6.5531, -6.7069, -6.6462, -6.4523, -6.5954, -6.4264, -6.4487, -6.707, -6.4025, -6.3042, -6.4008, -6.3857, -6.3903, -6.3094, -6.2491, -6.3518, -6.3566, -6.4168, -6.2481, -6.3624, -6.2858, -6.2575, -6.3638, -6.452, -6.1835, -6.2754, -6.1253, -6.1645, -6.0638, -6.1262, -6.071, -6.1039, -6.4428, -6.1363, -6.1054, -6.1252, -6.1797, -6.0235, -6.0758, -5.9453, -6.0213, -6.0446], spk_cond_steps: [], stop_token_weight: 5.0, task_cls: usr.diffsinger_task.DiffSingerTask, test_ids: [], 
test_input_dir: , test_num: 0, test_prefixes: ['popcs-说散就散', 'popcs-隐形的翅膀'], test_set_name: test, timesteps: 100, 
train_set_name: train, use_denoise: False, use_energy_embed: False, use_gt_dur: True, use_gt_f0: True, 
use_nsf: True, use_pitch_embed: True, use_pos_embed: True, use_spk_embed: False, use_spk_id: False, 
use_split_spk_id: False, use_uv: True, use_var_enc: False, val_check_interval: 2000, valid_num: 0, 
valid_set_name: valid, validate: False, vocoder: vocoders.hifigan.HifiGAN, vocoder_ckpt: checkpoints/0109_hifigan_bigpopcs_hop128, warmup_updates: 2000, 
weight_decay: 0, win_size: 512, work_dir: , 
| Binarizer:  <class 'data_gen.singing.binarize.SingingBinarizer'>
| spk_map:  {}
| Build phone set:  []
0it [00:00, ?it/s]
| valid total duration: 0.000s
0it [00:00, ?it/s]
| test total duration: 0.000s
0it [00:00, ?it/s]
| train total duration: 0.000s

It creates the folder data/binary/popcs-pmf0 with 11 files, but they seem to be essentially empty.
Can you please tell what I am missing, why it does not find or use the dataset?

Question about Denoiser Residual Block

Hello, I have a question about the structure of the denoiser's residual blocks.
The paper states that the model was inspired by DiffWave, but there is a slight difference between their architectures.
DiffSinger directly adds the step embedding to the input in every block, whereas in DiffWave every block has an extra FC layer that processes the step embedding before adding it.
What is the reason behind this change? Is there a performance difference?

Data preparation Error at new commit 777d199

How to train HifiGAN-Singing from scratch?

Could you please give some instructions? Thank you very much

Possibly missing file between Step 1 and 2 for SVS?

Hi, thank you very much for your valuable SVS corpus and code.

I strictly follow your instruction until step "2. Training Example" for SVS, in https://github.com/MoonInTheRiver/DiffSinger . Then I am somewhat stuck here. The error message is:

Validation sanity check: 0%| | 0/1 [00:00<?, ?batch/s]
Traceback (most recent call last):
File "tasks/run.py", line 19, in
run_task()
File "tasks/run.py", line 14, in run_task
task_cls.start()
File "/data/juicefs_speech_tts_v2/public_data/tts_public_data/11090357/singing/diffsinger/DiffSinger/tasks/base_task.py", line 256, in start
trainer.fit(task)
File "/data/juicefs_speech_tts_v2/public_data/tts_public_data/11090357/singing/diffsinger/DiffSinger/utils/pl_utils.py", line 489, in fit
self.run_pretrain_routine(model)
File "/data/juicefs_speech_tts_v2/public_data/tts_public_data/11090357/singing/diffsinger/DiffSinger/utils/pl_utils.py", line 565, in run_pretrain_routine
self.evaluate(model, self.get_val_dataloaders(), self.num_sanity_val_steps, self.testing)
File "/data/juicefs_speech_tts_v2/public_data/tts_public_data/11090357/singing/diffsinger/DiffSinger/utils/pl_utils.py", line 1173, in evaluate
for batch_idx, batch in enumerate(dataloader):
File "/root/miniconda3/envs/diffsinger/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 363, in next
data = self._next_data()
File "/root/miniconda3/envs/diffsinger/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 989, in _next_data
return self._process_data(data)
File "/root/miniconda3/envs/diffsinger/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1014, in _process_data
data.reraise()
File "/root/miniconda3/envs/diffsinger/lib/python3.8/site-packages/torch/_utils.py", line 395, in reraise
raise self.exc_type(msg)
FileNotFoundError: Caught FileNotFoundError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/root/miniconda3/envs/diffsinger/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 185, in _worker_loop
data = fetcher.fetch(index)
File "/root/miniconda3/envs/diffsinger/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/root/miniconda3/envs/diffsinger/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 44, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/data/juicefs_speech_tts_v2/public_data/tts_public_data/11090357/singing/diffsinger/DiffSinger/usr/diffsinger_task.py", line 93, in getitem
fs2_mel = torch.Tensor(np.load(f'{fs2_ckpt}/P_mels_npy/{item_name}.npy')) # ~M generated by FFT-singer.
File "/root/miniconda3/envs/diffsinger/lib/python3.8/site-packages/numpy/lib/npyio.py", line 416, in load
fid = stack.enter_context(open(os_fspath(file), "rb"))
FileNotFoundError: [Errno 2] No such file or directory: 'checkpoints/popcs_fs2_pmf0_1230/P_mels_npy/popcs-说散就散-0000.npy'

It seems that the required file is not properly process in "1. Data Preparation" step, though the first step was passed successfully with the following prompt:

test_input_dir: , test_num: 0, test_prefixes: ['popcs-说散就散', 'popcs-隐形的翅膀'], test_set_name: test, timesteps: 100,
train_set_name: train, use_denoise: False, use_energy_embed: False, use_gt_dur: True, use_gt_f0: True,
use_nsf: True, use_pitch_embed: True, use_pos_embed: True, use_spk_embed: False, use_spk_id: False,
use_split_spk_id: False, use_uv: True, use_var_enc: False, val_check_interval: 2000, valid_num: 0,
valid_set_name: valid, validate: False, vocoder: vocoders.hifigan.HifiGAN, vocoder_ckpt: checkpoints/0109_hifigan_bigpopcs_hop128, warmup_updates: 2000,
weight_decay: 0, win_size: 512, work_dir: ,
| Binarizer: <class 'data_gen.singing.binarize.SingingBinarizer'>
| spk_map: {'SPK1': 0}
| Build phone set: ['', '', '', 'a', 'ai', 'an', 'ang', 'ao', 'b', 'c', 'ch', 'd', 'e', 'ei', 'en', 'eng', 'er', 'f', 'g', 'h', 'i', 'ia', 'ian', 'iang', 'iao', 'ie', 'in', 'ing', 'iong', 'iou', 'j', 'k', 'l', 'm', 'n', 'o', 'ong', 'ou', 'p', 'q', 'r', 's', 'sh', 't', 'u', 'ua', 'uai', 'uan', 'uang', 'uei', 'uen', 'uo', 'v', 'van', 've', 'vn', 'x', 'z', 'zh', '|']
100%|████████████████████████████████████████████| 27/27 [00:13<00:00, 2.01it/s]
| valid total duration: 330.677s
100%|████████████████████████████████████████████| 27/27 [00:13<00:00, 2.04it/s]
| test total duration: 330.677s
100%|████████████████████████████████████████████| 1624/1624 [11:55<00:00, 2.27it/s]
| train total duration: 20878.560s

I guess the output of Step 1 and input of Step 2 are possibly not chained perfectly. Any help or hints will be welcome. Thank you in advance.

multi speaker sing

By enable the "with_spk_embed" option, and then retaining the model, can it support multi speaker singing?

How to train a model from scratch, with new data set?

The training steps given in the readme.md for DiffSinger require your saved checkpoints and your training data.
Can you please indicate how train a model from scratch, with a new data set?

Can you produce a version of MIDI with phoneme duration prediction

Now the midi version is use_gt_dur @MoonInTheRiver

How to inference on other <midi, text> files using the opencpop phone_set.json?

Hi, thanks for the great work!
I want to inference on my own <midi, text> files. I generated the corresponding meta.json and tried to binarize it. But the binarizer can only generate phone_set.json based on the input files, which will be incompatible with the pretrained checkpoint (since it was pretrained on opencpop with opencpop phone_set.json). A solution is to preprocess the customized testset with the original opencpop meta.json to obtain the full phone_set.json (i.e., the same phoneme dict). However, it is too inconvenient to do that thing.

How to convert the notes in 'opencpop/segments/transcriptions'?

Thank you very much for your great work. I notice that in the original transcription file of opencpop, the notes are like this: G#4/Ab4 G#4/Ab4 G#4/Ab4 G#4/Ab4 F#4/Gb4 F#4/Gb4 F#4/Gb4 F#4/Gb4 E4 E4 E4 E4 D#4/Eb4 D#4/Eb4 D#4/Eb4 D#4/Eb4 rest E4 E4 E4 E4 rest. Could you please tell me how do you convert the notes into numbers like "68 68 68 68 66 66 66..." as you've mentioned in https://github.com/MoonInTheRiver/DiffSinger/blob/master/usr/configs/midi/cascade/opencs/eg_opencpop.png? (I'm
new to music and midi)
Thanks:)

CUDA: out of memory

hi,team：

I am try to training with command CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config usr/configs/midi/e2e/opencpop/ds100_adj_rel.yaml --exp_name $MY_DS_EXP_NAME --reset, but the error shows RuntimeError: CUDA out of memory. Tried to allocate 36.00 MiB (GPU 0; 8.00 GiB total capacity; 6.08 GiB already allocated; 0 bytes free; 6.81 GiB reserved in total by PyTorch).

it's a 8GB memory GPU 3060, how can we configure something let them to work?

I find that there are some bad cases in F0 prediction.

I find that there are some bad cases in F0 prediction. I recommend people to increase 'predictor_layers' or decrease 'predictor_dropout' to enhance the ability of pitch predictor for the part of MIDI version. Also, increasing 'max_tokens' could be helpful if more powerful GPUs are available.

Originally posted by @MoonInTheRiver in #31 (comment)

How to train another language?

I think this "DiffSinger" model is based on Chinese.
Please give me advice on how to train them in another language.
Thank you for share!

Determining the durations of segmentation operators (|)

The MFA outputs don't really provide the durations/frames between the words, and I checked that this project uses the duration of the SEG token (word separator). It is many times 0 and other times not, so I wanted to ask how did you get that on preprocessing step?

Some details about re-training MFA?

Thank you very much for providing PopCS for free~!
When I reading your paper, I noticed you re-trained a Montreal Forced Aligner tool to build the dataset PopCS. Would you please provide some training details, such as, 1) what data is used as the training data? 2) the amount of training data.

Is there anyone covert model to torch script successfully?

Pretraining FFT-Singer

Share the hifi training code: task_cls: tasks.vocoder.hifigan.HifiGanTask

Hi @MoonInTheRiver ,

Thank you for release the Diffsinger code.
Could you share the hifi training code: task_cls: tasks.vocoder.hifigan.HifiGanTask?

please be patient when applying for datasets :)

I will give you the link within seven days. If not, email me again.

Error: different shape of model parameter when generate example using diffsinger

05/14 04:33:48 AM gpu available: True, used: True
| model Arch:  OfflineGaussianDiffusion
.... 


Traceback (most recent call last):
  File "tasks/run.py", line 15, in <module>
    run_task()
  File "tasks/run.py", line 10, in run_task
    task_cls.start()
  File "/mnt/DiffSinger/tasks/base_task.py", line 258, in start
    trainer.test(task)
  File "/mnt/DiffSinger/utils/pl_utils.py", line 586, in test
    self.fit(model)
  File "/mnt/DiffSinger/utils/pl_utils.py", line 489, in fit
    self.run_pretrain_routine(model)
  File "/mnt/DiffSinger/utils/pl_utils.py", line 541, in run_pretrain_routine
    self.restore_weights(model)
  File "/mnt/DiffSinger/utils/pl_utils.py", line 617, in restore_weights
    self.restore_state_if_checkpoint_exists(model)
  File "/mnt/DiffSinger/utils/pl_utils.py", line 655, in restore_state_if_checkpoint_exists
    self.restore(last_ckpt_path, self.on_gpu)
  File "/mnt/DiffSinger/utils/pl_utils.py", line 668, in restore
    model.load_state_dict(checkpoint['state_dict'], strict=False)
  File "/root/miniconda3/envs/env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1223, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for DiffSingerOfflineTask:
        size mismatch for model.fs2.encoder_embed_tokens.weight: copying a param with shape torch.Size([62, 256]) from checkpoint, the shape in current model is torch.Size([57, 256]).
        size mismatch for model.fs2.encoder.embed_tokens.weight: copying a param with shape torch.Size([62, 256]) from checkpoint, the shape in current model is torch.Size([57, 256]).

I got this runtime error when i tried
CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config usr/configs/popcs_ds_beta6_offline.yaml --exp_name popcs_ds_beta6_offline_pmf0_1230 --reset --infer
exactly the same in README.
I change the version of scipy and torch a little bit, i dont now if it is the problem.

Basically I just followed the instruction:
I have put

[DiffSinger](https://github.com/MoonInTheRiver/DiffSinger/releases/download/pretrain-model/popcs_ds_beta6_offline_pmf0_1230.zip)的预训练模型;
[FFT-Singer](https://github.com/MoonInTheRiver/DiffSinger/releases/download/pretrain-model/popcs_fs2_pmf0_1230.zip)的预训练模型, 这是为了DiffSinger里的浅扩散机制;

in checkpoints
and put data-example in /DiffSinger/data/processed/popcs/popcs-说散就散, and preprocess it, got mnt/DiffSinger/data/binary/popcs-pmf0.

I suspect I got something wrong
or just the author unintentionally give out a ill-shaped pretrain model....?

about 'An Easier Trick for Boundary Prediction'

In your paper, we can get the predicted boundary as follows:

then I implemented 'An Easier Trick for Boundary Prediction' in my repo following the trick:
https://github.com/keonlee9420/DiffSinger/blob/f849f8def5abb38ad272a384e8bec838ea1957a4/boundary_predictor.py#L14-L45

and there are some helper functions for that (please focus on expected_kld_t and expected_kld_T function):
https://github.com/keonlee9420/DiffSinger/blob/f849f8def5abb38ad272a384e8bec838ea1957a4/model/diffusion.py#L351-L389

But as I noted in my README.md (in 2. of note section), the predicted boundary of LJSpeech is 100, which is the same as the total timesteps in Naive version.

So I'd like to ask you to briefly check my implementation. Could you please take a look at it and let me know if I missed something? Why do you think my boundary predictor shows unexpected K_step?

FYI, here is the sample output log of running boundary_predictor.py:

==================================== Prediction Configuration ====================================
 ---> Total Batch Size: 48
 ---> Path of ckpt: ./output/ckpt/LJSpeech_shallow_el_4
================================================================================================
100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 11/11 [00:08<00:00,  1.34it/s]
[tensor(6959.2134, device='cuda:0'), tensor(933.3702, device='cuda:0'), tensor(403.9860, device='cuda:0'), tensor(249.2317, device='cuda:0'), tensor(183.4001, device='cuda:0'), tensor(149.2621, device='cuda:0'), tensor(129.2204, device='cuda:0'), tensor(116.2622, device='cuda:0'), tensor(107.4923, device='cuda:0'), tensor(101.0867, device='cuda:0'), tensor(96.2093, device='cuda:0'), tensor(92.4524, device='cuda:0'), tensor(89.3728, device='cuda:0'), tensor(86.7645, device='cuda:0'), tensor(84.4990, device='cuda:0'), tensor(82.5240, device='cuda:0'), tensor(80.7848, device='cuda:0'), tensor(79.1111, device='cuda:0'), tensor(77.5320, device='cuda:0'), tensor(76.0396, device='cuda:0'), tensor(74.6199, device='cuda:0'), tensor(73.2726, device='cuda:0'), tensor(71.9328, device='cuda:0'), tensor(70.6272, device='cuda:0'), tensor(69.2854, device='cuda:0'), tensor(68.0120, device='cuda:0'), tensor(66.7351, device='cuda:0'), tensor(65.4260, device='cuda:0'), tensor(64.1837, device='cuda:0'), tensor(62.9117, device='cuda:0'), tensor(61.6452, device='cuda:0'), tensor(60.3592, device='cuda:0'), tensor(59.0823, device='cuda:0'), tensor(57.8210, device='cuda:0'), tensor(56.5481, device='cuda:0'), tensor(55.2716, device='cuda:0'), tensor(54.0141, device='cuda:0'), tensor(52.7686, device='cuda:0'), tensor(51.4833, device='cuda:0'), tensor(50.2068, device='cuda:0'), tensor(48.9261, device='cuda:0'), tensor(47.6881, device='cuda:0'), tensor(46.4407, device='cuda:0'), tensor(45.2071, device='cuda:0'), tensor(43.9496, device='cuda:0'), tensor(42.7181, device='cuda:0'), tensor(41.5266, device='cuda:0'), tensor(40.2994, device='cuda:0'), tensor(39.1266, device='cuda:0'), tensor(37.9398, device='cuda:0'), tensor(36.7822, device='cuda:0'), tensor(35.6130, device='cuda:0'), tensor(34.5006, device='cuda:0'), tensor(33.3484, device='cuda:0'), tensor(32.2580, device='cuda:0'), tensor(31.1593, device='cuda:0'), tensor(30.1051, device='cuda:0'), tensor(29.0614, device='cuda:0'), tensor(28.0244, device='cuda:0'), tensor(27.0115, device='cuda:0'), tensor(26.0248, device='cuda:0'), tensor(25.0589, device='cuda:0'), tensor(24.1051, device='cuda:0'), tensor(23.1736, device='cuda:0'), tensor(22.2743, device='cuda:0'), tensor(21.3856, device='cuda:0'), tensor(20.5282, device='cuda:0'), tensor(19.6825, device='cuda:0'), tensor(18.8733, device='cuda:0'), tensor(18.0839, device='cuda:0'), tensor(17.3134, device='cuda:0'), tensor(16.5815, device='cuda:0'), tensor(15.8417, device='cuda:0'), tensor(15.1426, device='cuda:0'), tensor(14.4522, device='cuda:0'), tensor(13.8025, device='cuda:0'), tensor(13.1645, device='cuda:0'), tensor(12.5432, device='cuda:0'), tensor(11.9491, device='cuda:0'), tensor(11.3789, device='cuda:0'), tensor(10.8328, device='cuda:0'), tensor(10.2960, device='cuda:0'), tensor(9.7815, device='cuda:0'), tensor(9.2841, device='cuda:0'), tensor(8.8136, device='cuda:0'), tensor(8.3660, device='cuda:0'), tensor(7.9211, device='cuda:0'), tensor(7.5027, device='cuda:0'), tensor(7.1040, device='cuda:0'), tensor(6.7245, device='cuda:0'), tensor(6.3511, device='cuda:0'), tensor(6.0048, device='cuda:0'), tensor(5.6679, device='cuda:0'), tensor(5.3475, device='cuda:0'), tensor(5.0427, device='cuda:0'), tensor(4.7507, device='cuda:0'), tensor(4.4784, device='cuda:0'), tensor(4.2143, device='cuda:0'), tensor(3.9639, device='cuda:0'), tensor(3.7258, device='cuda:0')]
tensor(0.2382, device='cuda:0')

Predicted Boundary K is 100

Thanks in advance!

Inference with unseen songs

Hi. Since the DiffSinger(PopCS) needs ground-truth f0 information at inference, is it possible to synthesize an unseen song (with phoneme labels, phoneme duration and notes provided) using the DIffSinger(PopCS) model?

When will release the midi version

Hello @MoonInTheRiver ,

Thank you release the singing version!

When will release the midi version?

When will release the Sing version and pretained model

Hi @MoonInTheRiver,

Thank you for share!
When will release the Sing version and pretained model?

DiffSinger infer problem

I want to test opencpop preitrain model on unseen song. I don't know how to generate the wav file.

What data I should prepare for model?
How to do it? I saw test_step in FastSpeech2Task, but it seems for tts task. So I need override test_step in DiffSingerMIDITask? Is there other way to solve this? Without packing data into dataloader, just load model, and infer.

Looking forward for the midi data of POPCS~!

What's the '0102_xiaoma_pe.zip file' for?

Proposing a fix for inconsistent f0 length caused by different versions of parselmouth

Hi,
I noticed that different versions of parselmouth would result in different length of the computed f0. This is also mentioned in the comment of your code.

DiffSinger/data_gen/tts/data_gen_utils.py

Line 176 in ae3e8f0

 # Attention: we find that new version of some libraries could cause ``rpad'' to be a negetive value... 

I think the reason for this is that, in parselmouth v0.4.0, the default value of the very_accurate parameter of Sound.to_pitch_ac was changed to False, matching Praat's default (See https://github.com/YannickJadoul/Parselmouth/releases). Though I didn't make a rigorous test, I think adding very_accurate=True in get_pitch would fix the inconsistent length issue and make your code more compatible.

question about fs2 infer

Hi, thank you very much for your valuable SVS corpus and code.
I strictly follow your instruction until step "2. Training Example" for SVS, in https://github.com/MoonInTheRiver/DiffSinger . Then I am somewhat stuck here. The error message is:
Validation sanity check: 0%| | 0/1 [00:00<?, ?batch/s] Traceback (most recent call last): File "tasks/run.py", line 19, in run_task() File "tasks/run.py", line 14, in run_task task_cls.start() File "/data/juicefs_speech_tts_v2/public_data/tts_public_data/11090357/singing/diffsinger/DiffSinger/tasks/base_task.py", line 256, in start trainer.fit(task) File "/data/juicefs_speech_tts_v2/public_data/tts_public_data/11090357/singing/diffsinger/DiffSinger/utils/pl_utils.py", line 489, in fit self.run_pretrain_routine(model) File "/data/juicefs_speech_tts_v2/public_data/tts_public_data/11090357/singing/diffsinger/DiffSinger/utils/pl_utils.py", line 565, in run_pretrain_routine self.evaluate(model, self.get_val_dataloaders(), self.num_sanity_val_steps, self.testing) File "/data/juicefs_speech_tts_v2/public_data/tts_public_data/11090357/singing/diffsinger/DiffSinger/utils/pl_utils.py", line 1173, in evaluate for batch_idx, batch in enumerate(dataloader): File "/root/miniconda3/envs/diffsinger/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 363, in next data = self._next_data() File "/root/miniconda3/envs/diffsinger/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 989, in _next_data return self._process_data(data) File "/root/miniconda3/envs/diffsinger/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1014, in _process_data data.reraise() File "/root/miniconda3/envs/diffsinger/lib/python3.8/site-packages/torch/_utils.py", line 395, in reraise raise self.exc_type(msg) FileNotFoundError: Caught FileNotFoundError in DataLoader worker process 0. Original Traceback (most recent call last): File "/root/miniconda3/envs/diffsinger/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 185, in _worker_loop data = fetcher.fetch(index) File "/root/miniconda3/envs/diffsinger/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] File "/root/miniconda3/envs/diffsinger/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 44, in data = [self.dataset[idx] for idx in possibly_batched_index] File "/data/juicefs_speech_tts_v2/public_data/tts_public_data/11090357/singing/diffsinger/DiffSinger/usr/diffsinger_task.py", line 93, in getitem fs2_mel = torch.Tensor(np.load(f'{fs2_ckpt}/P_mels_npy/{item_name}.npy')) # ~M generated by FFT-singer. File "/root/miniconda3/envs/diffsinger/lib/python3.8/site-packages/numpy/lib/npyio.py", line 416, in load fid = stack.enter_context(open(os_fspath(file), "rb")) FileNotFoundError: [Errno 2] No such file or directory: 'checkpoints/popcs_fs2_pmf0_1230/P_mels_npy/popcs-说散就散-0000.npy'
It seems that the required file is not properly process in "1. Data Preparation" step, though the first step was passed successfully with the following prompt:
test_input_dir: , test_num: 0, test_prefixes: ['popcs-说散就散', 'popcs-隐形的翅膀'], test_set_name: test, timesteps: 100, train_set_name: train, use_denoise: False, use_energy_embed: False, use_gt_dur: True, use_gt_f0: True, use_nsf: True, use_pitch_embed: True, use_pos_embed: True, use_spk_embed: False, use_spk_id: False, use_split_spk_id: False, use_uv: True, use_var_enc: False, val_check_interval: 2000, valid_num: 0, valid_set_name: valid, validate: False, vocoder: vocoders.hifigan.HifiGAN, vocoder_ckpt: checkpoints/0109_hifigan_bigpopcs_hop128, warmup_updates: 2000, weight_decay: 0, win_size: 512, work_dir: , | Binarizer: <class 'data_gen.singing.binarize.SingingBinarizer'> | spk_map: {'SPK1': 0} | Build phone set: ['', '', '', 'a', 'ai', 'an', 'ang', 'ao', 'b', 'c', 'ch', 'd', 'e', 'ei', 'en', 'eng', 'er', 'f', 'g', 'h', 'i', 'ia', 'ian', 'iang', 'iao', 'ie', 'in', 'ing', 'iong', 'iou', 'j', 'k', 'l', 'm', 'n', 'o', 'ong', 'ou', 'p', 'q', 'r', 's', 'sh', 't', 'u', 'ua', 'uai', 'uan', 'uang', 'uei', 'uen', 'uo', 'v', 'van', 've', 'vn', 'x', 'z', 'zh', '|'] 100%|████████████████████████████████████████████| 27/27 [00:13<00:00, 2.01it/s] | valid total duration: 330.677s 100%|████████████████████████████████████████████| 27/27 [00:13<00:00, 2.04it/s] | test total duration: 330.677s 100%|████████████████████████████████████████████| 1624/1624 [11:55<00:00, 2.27it/s] | train total duration: 20878.560s
I guess the output of Step 1 and input of Step 2 are possibly not chained perfectly. Any help or hints will be welcome. Thank you in advance.

Yes, you are right. There is a problem. You need run "CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config usr/configs/popcs_fs2.yaml --exp_name popcs_fs2_pmf0_1230 --reset --infer" in advance to produce the files "P_mels_npy". I have fixed the readme file. Thanks for your report!

I have a question about it, if we are training from scratch and we don't have any saved models for inference, how does the P_mels_npy(predicted mels I guess) generated?

Originally posted by @Cescfangs in #11 (comment)

AttributeError: 'HifiGAN' object has no attribute 'model'

I'm following the instructions to train a model for DiffSinger, but there seem to be some issues with vocoders/hifigan.py:

"# first run fs2 infer;"

$ CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config usr/configs/popcs_fs2.yaml --exp_name popcs_fs2_pmf0_1230 --reset --infer 
...
| model Trainable Parameters: 24.256M
  0%|                                                                                                                                                       | 0/1 [00:00<?, ?it/s]
Traceback (most recent call last):                                                                                                                          | 0/1 [00:00<?, ?it/s]
  File "tasks/run.py", line 15, in <module>
    run_task()
  File "tasks/run.py", line 10, in run_task
    task_cls.start()
  File "/.../DiffSinger/tasks/base_task.py", line 258, in start
    trainer.test(task)
  File "/.../DiffSinger/utils/pl_utils.py", line 586, in test
    self.fit(model)
  File "/.../DiffSinger/utils/pl_utils.py", line 489, in fit
    self.run_pretrain_routine(model)
  File "/.../DiffSinger/utils/pl_utils.py", line 545, in run_pretrain_routine
    self.run_evaluation(test=True)
  File "/.../DiffSinger/utils/pl_utils.py", line 1245, in run_evaluation
    eval_results = self.evaluate(self.model,
  File "/.../DiffSinger/utils/pl_utils.py", line 1185, in evaluate
    output = self.evaluation_forward(model,
  File "/.../DiffSinger/utils/pl_utils.py", line 1307, in evaluation_forward
    output = model.test_step(*args)
  File "/.../DiffSinger/tasks/tts/fs2.py", line 363, in test_step
    return self.after_infer(sample)
  File "/.../DiffSinger/tasks/tts/fs2.py", line 410, in after_infer
    wav_pred = self.vocoder.spec2wav(mel_pred, f0=f0_pred)
  File "/.../DiffSinger/vocoders/hifigan.py", line 56, in spec2wav
    device = self.device
AttributeError: 'HifiGAN' object has no attribute 'device'
Testing:   0%|          | 0/14 [00:01<?, ?batch/s]

So I've added self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu") to the __init__ of HifiGAN, but then I get this:

| model Trainable Parameters: 24.256M
  0%|                                                                                                                                                       | 0/1 [00:00<?, ?it/s]
Traceback (most recent call last):                                                                                                                          | 0/1 [00:00<?, ?it/s]
  File "tasks/run.py", line 15, in <module>
    run_task()
  File "tasks/run.py", line 10, in run_task
    task_cls.start()
  File "/.../DiffSinger/tasks/base_task.py", line 258, in start
    trainer.test(task)
  File "/.../DiffSinger/utils/pl_utils.py", line 586, in test
    self.fit(model)
  File "/.../DiffSinger/utils/pl_utils.py", line 489, in fit
    self.run_pretrain_routine(model)
  File "/.../DiffSinger/utils/pl_utils.py", line 545, in run_pretrain_routine
    self.run_evaluation(test=True)
  File "/.../DiffSinger/utils/pl_utils.py", line 1245, in run_evaluation
    eval_results = self.evaluate(self.model,
  File "/.../DiffSinger/utils/pl_utils.py", line 1185, in evaluate
    output = self.evaluation_forward(model,
  File "/.../DiffSinger/utils/pl_utils.py", line 1307, in evaluation_forward
    output = model.test_step(*args)
  File "/.../DiffSinger/tasks/tts/fs2.py", line 363, in test_step
    return self.after_infer(sample)
  File "/.../DiffSinger/tasks/tts/fs2.py", line 410, in after_infer
    wav_pred = self.vocoder.spec2wav(mel_pred, f0=f0_pred)
  File "/.../DiffSinger/vocoders/hifigan.py", line 64, in spec2wav
    y = self.model(c, f0).view(-1)
AttributeError: 'HifiGAN' object has no attribute 'model'
Testing:   0%|          | 0/14 [00:01<?, ?batch/s]

train issue with private dataset

I train diffsing with private dataset but i got some results not clear.
Can you tell me the reason of this picture?

How to do the DiffSinge test, why does my program report no errors but no audio files are generated?

about some missing parts

Hi, thanks for your contribution on DiffSinger! and also thanks for mentioning my implementation, I just realized it yesterday:)

With your detailed documentation in README and paper, I can reproduce the training & inference procedure and the results with this repo. But during that, I found some missing parts to get the full training with shallow version: I think the current code only supports forced K (which is 71) with the pre-trained FastSpeech2 (especially of the decoder). If I understood correctly, we need a process for the boundary prediction and pre-training of FastSpeech2 before training DiffSpeech in shallow. Maybe I missed somewhere in the repo, but if it is not yet pushed, I wonder whether you have planned to provide that part soon or not.

Thanks in advance!

Best,
keon

	def norm_interp_f0(f0, hparams):
	is_torch = isinstance(f0, torch.Tensor)
	if is_torch:
	device = f0.device
	f0 = f0.data.cpu().numpy()
	uv = f0 == 0
	f0 = norm_f0(f0, uv, hparams)

moonintheriver / diffsinger Goto Github PK

diffsinger's Introduction

DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism

Environments

Documents

Overview

Tensorboard

Citation

Acknowledgements

diffsinger's People

Contributors

Stargazers

Watchers

Forkers

diffsinger's Issues

Recommend Projects

Recommend Topics

Recommend Org