Giter VIP home page Giter VIP logo

ddsp-svc's Introduction

Language: English 简体中文 한국어(outdated)

DDSP-SVC

(6.0 - Experimental) New rectified-flow based model

(1) Preprocessing:

python preprocess.py -c configs/reflow.yaml

(2) Training:

python train_reflow.py -c configs/reflow.yaml

(3) Non-real-time inference:

python main_reflow.py -i <input.wav> -m <model_ckpt.pt> -o <output.wav> -k <keychange (semitones)> -id <speaker_id> -step <infer_step> -method <method> -ts <t_start>

'infer_step' is the number of sampling steps for rectified-flow ODE, 'method' is 'euler' or 'rk4', 't_start' is the start time point of ODE, which needs to be larger than or equal to t_start in the configuration file, it is recommended to keep it equal (the default is 0.7)

(5.0 - Update) Improved DDSP cascade diffusion model

Installing dependencies, data preparation, configuring the pre-trained encoder (hubert or contentvec ) , pitch extractor (RMVPE) and vocoder (nsf-hifigan) are the same as training a pure DDSP model (See section below).

We provide a pre-trained model in the release page.

Move the model_0.pt to the model export folder specified by the 'expdir' parameter in diffusion-fast.yaml, and the program will automatically load the pre-trained model in that folder.

(1) Preprocessing:

python preprocess.py -c configs/diffusion-fast.yaml

(2) Train a cascade model (only train one model):

python train_diff.py -c configs/diffusion-fast.yaml

(3) Non-real-time inference:

python main_diff.py -i <input.wav> -diff <diff_ckpt.pt> -o <output.wav> -k <keychange (semitones)> -id <speaker_id> -speedup <speedup> -method <method> -kstep <kstep>

The 5.0 version model has a built-in DDSP model, so specifying an external DDSP model using -ddsp is unnecessary. The other options have the same meaning as the 3.0 version model, but 'kstep' needs to be less than or equal to k_step_max in the configuration file, it is recommended to keep it equal (the default is 100)

(4) Real-time GUI:

python gui_diff.py

Note: You need to load the version 5.0 model on the right hand side of the GUI

(4.0 - Update) New DDSP cascade diffusion model

Installing dependencies, data preparation, configuring the pre-trained encoder (hubert or contentvec ) , pitch extractor (RMVPE) and vocoder (nsf-hifigan) are the same as training a pure DDSP model (See section below).

We provide a pre-trained model here: https://huggingface.co/datasets/ms903/DDSP-SVC-4.0/resolve/main/pre-trained-model/model_0.pt (using 'contentvec768l12' encoder)

Move the model_0.pt to the model export folder specified by the 'expdir' parameter in diffusion-new.yaml, and the program will automatically load the pre-trained model in that folder.

(1) Preprocessing:

python preprocess.py -c configs/diffusion-new.yaml

(2) Train a cascade model (only train one model):

python train_diff.py -c configs/diffusion-new.yaml

Note: There is a temporary problem with fp16 training, but fp32 and bf16 are working normally,

(3) Non-real-time inference:

python main_diff.py -i <input.wav> -diff <diff_ckpt.pt> -o <output.wav> -k <keychange (semitones)> -id <speaker_id> -speedup <speedup> -method <method> -kstep <kstep>

The 4.0 version model has a built-in DDSP model, so specifying an external DDSP model using -ddsp is unnecessary. The other options have the same meaning as the 3.0 version model, but 'kstep' needs to be less than or equal to k_step_max in the configuration file, it is recommended to keep it equal (the default is 100)

(4) Real-time GUI:

python gui_diff.py

Note: You need to load the version 4.0 model on the right hand side of the GUI

(3.0 - Update) Shallow diffusion model (DDSP + Diff-SVC refactor version)

Diagram

Installing dependencies, data preparation, configuring the pre-trained encoder (hubert or contentvec ) , pitch extractor (RMVPE) and vocoder (nsf-hifigan) are the same as training a pure DDSP model (See chapter 1 ~ 3 below).

Because the diffusion model is more difficult to train, we provide some pre-trained models here:

https://huggingface.co/datasets/ms903/Diff-SVC-refactor-pre-trained-model/blob/main/hubertsoft_fix_pitch_add_vctk_500k/model_0.pt (using 'hubertsoft' encoder)

https://huggingface.co/datasets/ms903/Diff-SVC-refactor-pre-trained-model/blob/main/fix_pitch_add_vctk_600k/model_0.pt (using 'contentvec768l12' encoder)

Move the model_0.pt to the model export folder specified by the 'expdir' parameter in diffusion.yaml, and the program will automatically load the pre-trained models in that folder.

(1) Preprocessing:

python preprocess.py -c configs/diffusion.yaml

This preprocessing can also be used to train the DDSP model without preprocessing twice, but you need to ensure that the parameters under the 'data' tag in yaml files are consistent.

(2) Train a diffusion model:

python train_diff.py -c configs/diffusion.yaml

(3) Train a DDSP model:

python train.py -c configs/combsub.yaml

As mentioned above, re-preprocessing is not required, but please check whether the parameters of combsub.yaml and diffusion.yaml match. The number of speakers 'n_spk' can be inconsistent, but try to use the same id to represent the same speaker (this makes inference easier).

(4) Non-real-time inference:

python main_diff.py -i <input.wav> -ddsp <ddsp_ckpt.pt> -diff <diff_ckpt.pt> -o <output.wav> -k <keychange (semitones)> -id <speaker_id> -diffid <diffusion_speaker_id> -speedup <speedup> -method <method> -kstep <kstep>

'speedup' is the acceleration speed, 'method' is 'ddim', 'pndm', 'dpm-solver' or 'unipc', 'kstep' is the number of shallow diffusion steps, 'diffid' is the speaker id of the diffusion model, and other parameters have the same meaning as main.py.

A reasonable 'kstep' is about 100~300. There may be a perceived loss of sound quality when ‘speedup’ exceeds 20.

If the same id has been used to represent the same speaker during training, '-diffid' can be empty, otherwise the '-diffid' option needs to be specified.

If '-ddsp' is empty, the pure diffusion model is used, at this time, shallow diffusion is performed with the mel of the input source, and if further '-kstep' is empty, full-depth Gaussian diffusion is performed.

The program will automatically check whether the parameters of the DDSP model and the diffusion model match (sampling rate, hop size and encoder), and if they do not match, it will ignore loading the DDSP model and enter Gaussian diffusion mode.

(5) Real-time GUI:

python gui_diff.py

0. Introduction

DDSP-SVC is a new open source singing voice conversion project dedicated to the development of free AI voice changer software that can be popularized on personal computers.

Compared with the famous SO-VITS-SVC, its training and synthesis have much lower requirements for computer hardware, and the training time can be shortened by orders of magnitude, which is close to the training speed of RVC.

In addition, when performing real-time voice changing, the hardware resource consumption of this project is significantly lower than that of SO-VITS-SVC,but probably slightly higher than the latest version of RVC.

Although the original synthesis quality of DDSP is not ideal (the original output can be heard in tensorboard while training), after enhancing the sound quality with a pre-trained vocoder based enhancer (old version) or with a shallow diffusion model (new version) , for some datasets, it can achieve the synthesis quality no less than SOVITS-SVC and RVC.

The old version models are still compatible, the following chapters are the instructions for the old version. Some operations of the new version are the same, see the previous chapters.

Disclaimer: Please make sure to only train DDSP-SVC models with legally obtained authorized data, and do not use these models and any audio they synthesize for illegal purposes. The author of this repository is not responsible for any infringement, fraud and other illegal acts caused by the use of these model checkpoints and audio.

Update log: I am too lazy to translate, please see the Chinese version readme.

1. Installing the dependencies

We recommend first installing PyTorch from the official website, then run:

pip install -r requirements.txt

NOTE : I only test the code using python 3.8 (windows) + torch 1.9.1 + torchaudio 0.6.0, too new or too old dependencies may not work

UPDATE: python 3.8 (windows) + cuda 11.8 + torch 2.0.0 + torchaudio 2.0.1 works, and training is faster.

2. Configuring the pretrained model

  • Feature Encoder (choose only one):

(1) Download the pre-trained ContentVec encoder and put it under pretrain/contentvec folder.

(2) Download the pre-trained HubertSoft encoder and put it under pretrain/hubert folder, and then modify the configuration file at the same time.

  • Vocoder or enhancer:

Download and unzip the pre-trained NSF-HiFiGAN vocoder

or use the https://github.com/openvpi/SingingVocoders project to fine-tune the vocoder for higher sound quality.

Then rename the checkpoint file and place it at the location specified by the 'vocoder.ckpt' parameter in the configuration file. The default value is pretrain/nsf_hifigan/model.

The 'config.json' of the vocoder needs to be at the same directory, for example, pretrain/nsf_hifigan/config.json.

  • Pitch extractor:

Download the pre-trained RMVPE extractor and unzip it into pretrain/ folder.

3. Preprocessing

Put all the training dataset (.wav format audio clips) in the below directory: data/train/audio. Put all the validation dataset (.wav format audio clips) in the below directory: data/val/audio. You can also run

python draw.py

to help you select validation data (you can adjust the parameters in draw.py to modify the number of extracted files and other parameters)

Then run

python preprocess.py -c configs/combsub.yaml

for a model of combtooth substractive synthesiser (recommend), or run

python preprocess.py -c configs/sins.yaml

for a model of sinusoids additive synthesiser.

For training the diffusion model, see section 3.0, 4.0 or 5.0 above.

You can modify the configuration file config/<model_name>.yaml before preprocessing. The default configuration is suitable for training 44.1khz high sampling rate synthesiser with GTX-1660 graphics card.

NOTE 1: Please keep the sampling rate of all audio clips consistent with the sampling rate in the yaml configuration file ! If it is not consistent, the program can be executed safely, but the resampling during the training process will be very slow.

NOTE 2: The total number of the audio clips for training dataset is recommended to be about 1000, especially long audio clip can be cut into short segments, which will speed up the training, but the duration of all audio clips should not be less than 2 seconds. If there are too many audio clips, you need a large internal-memory or set the 'cache_all_data' option to false in the configuration file.

NOTE 3: The total number of the audio clips for validation dataset is recommended to be about 10, please don't put too many or it will be very slow to do the validation.

NOTE 4: If your dataset is not very high quality, set 'f0_extractor' to 'rmvpe' in the config file.

NOTE 5: Multi-speaker training is supported now. The 'n_spk' parameter in configuration file controls whether it is a multi-speaker model. If you want to train a multi-speaker model, audio folders need to be named with positive integers not greater than 'n_spk' to represent speaker ids, the directory structure is like below:

# training dataset
# the 1st speaker
data/train/audio/1/aaa.wav
data/train/audio/1/bbb.wav
...
# the 2nd speaker
data/train/audio/2/ccc.wav
data/train/audio/2/ddd.wav
...

# validation dataset
# the 1st speaker
data/val/audio/1/eee.wav
data/val/audio/1/fff.wav
...
# the 2nd speaker
data/val/audio/2/ggg.wav
data/val/audio/2/hhh.wav
...

If 'n_spk' = 1, The directory structure of the single speaker model is still supported, which is like below:

# training dataset
data/train/audio/aaa.wav
data/train/audio/bbb.wav
...
# validation dataset
data/val/audio/ccc.wav
data/val/audio/ddd.wav
...

4. Training

# train a combsub model as an example
python train.py -c configs/combsub.yaml

The command line for training other models is similar.

You can safely interrupt training, then running the same command line will resume training.

You can also finetune the model if you interrupt training first, then re-preprocess the new dataset or change the training parameters (batchsize, lr etc.) and then run the same command line.

5. Visualization

# check the training status using tensorboard
tensorboard --logdir=exp

Test audio samples will be visible in TensorBoard after the first validation.

NOTE: The test audio samples in Tensorboard are the original outputs of your DDSP-SVC model that is not enhanced by an enhancer. If you want to test the synthetic effect after using the enhancer (which may have higher quality) , please use the method described in the following chapter.

6. Non-real-time VC

(Recommend) Enhance the output using the pretrained vocoder-based enhancer:

# high audio quality in the normal vocal range if enhancer_adaptive_key = 0 (default)
# set enhancer_adaptive_key > 0 to adapt the enhancer to a higher vocal range
python main.py -i <input.wav> -m <model_file.pt> -o <output.wav> -k <keychange (semitones)> -id <speaker_id> -eak <enhancer_adaptive_key (semitones)>

Raw output of DDSP:

# fast, but relatively low audio quality (like you hear in tensorboard)
python main.py -i <input.wav> -m <model_file.pt> -o <output.wav> -k <keychange (semitones)> -id <speaker_id> -e false

Other options about the f0 extractor and response threhold,see:

python main.py -h

(UPDATE) Mix-speaker is supported now. You can use "-mix" option to design your own vocal timbre, below is an example:

# Mix the timbre of 1st and 2nd speaker in a 0.5 to 0.5 ratio
python main.py -i <input.wav> -m <model_file.pt> -o <output.wav> -k <keychange (semitones)> -mix "{1:0.5, 2:0.5}" -eak 0

7. Real-time VC

Start a simple GUI with the following command:

python gui.py

The front-end uses technologies such as sliding window, cross-fading, SOLA-based splicing and contextual semantic reference, which can achieve sound quality close to non-real-time synthesis with low latency and resource occupation.

Update: A splicing algorithm based on a phase vocoder is now added, but in most cases the SOLA algorithm already has high enough splicing sound quality, so it is turned off by default. If you are pursuing extreme low-latency real-time sound quality, you can consider turning it on and tuning the parameters carefully, and there is a possibility that the sound quality will be higher. However, a large number of tests have found that if the cross-fade time is longer than 0.1 seconds, the phase vocoder will cause a significant degradation in sound quality.

8. Acknowledgement

ddsp-svc's People

Contributors

aiczk avatar cardroid avatar cnchtu avatar ddpn08 avatar dillfrescott avatar entropyriser avatar fatinghenji avatar huanlinoto avatar kakaruhayate avatar l4ph avatar magic-akari avatar ms903x1 avatar narusemioshirakana avatar parkilwoo avatar petyin avatar therealkamisama avatar tylorshine avatar ylzz1997 avatar yqzhishen avatar yxlllc avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ddsp-svc's Issues

Questions in reasoning

When I proceed with inference using the model I have trained, it seems that there are a lot of voices of the original sound source left. Can I increase the ratio of the voices I have trained? (Same function as add_noise_step of diff-svc model)

訓練階段遇到的問題

Traceback (most recent call last):
File "train_diff.py", line 86, in
train(args, initial_global_step, model, optimizer, scheduler, vocoder, loader_train, loader_valid)
File "/home/csmxj/DDSP-SVC/diffusion/solver_new.py", line 188, in train
test_ddsp_loss, test_diff_loss = test(args, model, vocoder, loader_test, saver)
File "/home/csmxj/DDSP-SVC/diffusion/solver_new.py", line 25, in test
for bidx, data in enumerate(loader_test):
File "/root/.miniconda/envs/ddsp/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 633, in next
data = self._next_data()
File "/root/.miniconda/envs/ddsp/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 677, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/root/.miniconda/envs/ddsp/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 51, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/root/.miniconda/envs/ddsp/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 51, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/csmxj/DDSP-SVC/diffusion/data_loaders.py", line 203, in getitem
return self.getitem( (file_idx + 1) % len(self.paths))
File "/home/csmxj/DDSP-SVC/diffusion/data_loaders.py", line 203, in getitem
return self.getitem( (file_idx + 1) % len(self.paths))
File "/home/csmxj/DDSP-SVC/diffusion/data_loaders.py", line 203, in getitem
return self.getitem( (file_idx + 1) % len(self.paths))
[Previous line repeated 989 more times]
File "/home/csmxj/DDSP-SVC/diffusion/data_loaders.py", line 202, in getitem
if data_buffer['duration'] < (self.waveform_sec + 0.1):
RecursionError: maximum recursion depth exceeded in comparison

在訓練完一個區間後會做一次validation,就在這邊陷入RecursionError了。

推理时报错

使用的命令:python main.py -i G:\DDSP-SVC\samples\source.wav -m G:\DDSP-SVC\exp\combsub-test\model_68000.pt -o G:\DDSP-SVC\test.wav --enhancer_adaptive_key 0 -id 1 -k 0 -e true -pe crepe
日志:

 [DDSP Model] Combtooth Subtractive Synthesiser
 [Loading] G:\DDSP-SVC\exp\combsub-test\model_68000.pt
Pitch extractor type: crepe
Extracting the pitch curve of the input audio...
Extracting the volume envelope of the input audio...
 [Encoder Model] Content Vec
 [Loading] pretrain/hubert/checkpoint_best_legacy_500.pt
2023-04-01 22:15:12 | INFO | fairseq.tasks.hubert_pretraining | current directory is G:\DDSP-SVC
2023-04-01 22:15:12 | INFO | fairseq.tasks.hubert_pretraining | HubertPretrainingTask Config {'_name': 'hubert_pretraining', 'data': 'metadata', 'fine_tuning': False, 'labels': ['km'], 'label_dir': 'label', 'label_rate': 50.0, 'sample_rate': 16000, 'normalize': False, 'enable_padding': False, 'max_keep_size': None, 'max_sample_size': 250000, 'min_sample_size': 32000, 'single_target': False, 'random_crop': True, 'pad_audio': False}
2023-04-01 22:15:12 | INFO | fairseq.models.hubert.hubert | HubertModel Config: {'_name': 'hubert', 'label_rate': 50.0, 'extractor_mode': default, 'encoder_layers': 12, 'encoder_embed_dim': 768, 'encoder_ffn_embed_dim': 3072, 'encoder_attention_heads': 12, 'activation_fn': gelu, 'layer_type': transformer, 'dropout': 0.1, 'attention_dropout': 0.1, 'activation_dropout': 0.0, 'encoder_layerdrop': 0.05, 'dropout_input': 0.1, 'dropout_features': 0.1, 'final_dim': 256, 'untie_final_proj': True, 'layer_norm_first': False, 'conv_feature_layers': '[(512,10,5)] + [(512,3,2)] * 4 + [(512,2,2)] * 2', 'conv_bias': False, 'logit_temp': 0.1, 'target_glu': False, 'feature_grad_mult': 0.1, 'mask_length': 10, 'mask_prob': 0.8, 'mask_selection': static, 'mask_other': 0.0, 'no_mask_overlap': False, 'mask_min_space': 1, 'mask_channel_length': 10, 'mask_channel_prob': 0.0, 'mask_channel_selection': static, 'mask_channel_other': 0.0, 'no_mask_channel_overlap': False, 'mask_channel_min_space': 1, 'conv_pos': 128, 'conv_pos_groups': 16, 'latent_temp': [2.0, 0.5, 0.999995], 'skip_masked': False, 'skip_nomask': False, 'checkpoint_activations': False, 'required_seq_len_multiple': 2, 'depthwise_conv_kernel_size': 31, 'attn_type': '', 'pos_enc_type': 'abs', 'fp16': False}
Enhancer type: nsf-hifigan
| Load HifiGAN:  pretrain/nsf_hifigan/model
Traceback (most recent call last):
  File "main.py", line 197, in <module>
    enhancer = Enhancer(args.enhancer.type, args.enhancer.ckpt, device=device)
  File "G:\DDSP-SVC\enhancer.py", line 15, in __init__
    self.enhancer = NsfHifiGAN(enhancer_ckpt, device=self.device)
  File "G:\DDSP-SVC\enhancer.py", line 85, in __init__
    self.model, self.h = load_model(model_path, device=self.device)
  File "G:\DDSP-SVC\nsf_hifigan\models.py", line 17, in load_model
    with open(config_file) as f:
FileNotFoundError: [Errno 2] No such file or directory: 'pretrain/nsf_hifigan\\config.json'

错误日志中提到pretrain/nsf_hifigan\\config.json,但是我不清楚该文件应该如何生成或者下载。请问应该如何解决?

Samples

Hello author, Thank you for the work if you don't mind I would like to listen to one of the samples if there are any of them , I would like to hear the best quality ones that came out of that algorithm.

Thanks in advance.

提问

大佬,我想问一下,为啥我用ddsp做预处理的时候crepef0算法老是报错,RuntimeError: cuFFT error: CUFFT_INVALID_SIZE

使用的是b站于羽毛布球UP的整合包

有4G显存
屏幕截图 2023-08-28 231643
屏幕截图 2023-08-28 224809
屏幕截图 2023-08-28 224757

gui_diff.py -> ValueError: [x] Unknown Model: DiffusionNew

ran on python 3.8 (windows11) + cuda 11.8 + torch 2.0.0 + torchaudio 2.0.1
gui_diff.py would close itself on "start conversion", got it ran on pycharm and see what happened:

Traceback (most recent call last):
File "[...file name...]\DDSP-SVC-4.0\gui_diff.py", line 580, in
gui = GUI()
File "[...file name...]\DDSP-SVC-4.0\gui_diff.py", line 229, in init
self.launcher() # start
File "[...file name...]\DDSP-SVC-4.0\gui_diff.py", line 311, in launcher
self.event_handler()
File "[...file name...]\DDSP-SVC-4.0\gui_diff.py", line 335, in event_handler
self.start_vc()
File "[...file name...]\DDSP-SVC-4.0\gui_diff.py", line 447, in start_vc
self.svc_model.update_model(self.config.checkpoint_path)
File "[...file name...]\DDSP-SVC-4.0\gui_diff.py", line 52, in update_model
self.model, self.args = load_model(model_path, device=self.device)
File "[...file name...]\DDSP-SVC-4.0\ddsp\vocoder.py", line 486, in load_model
raise ValueError(f" [x] Unknown Model: {args.model.type}")
ValueError: [x] Unknown Model: DiffusionNew

model name "DiffusionNew" comes from this https://github.com/yxlllc/DDSP-SVC/blob/master/train_diff.py while certainly shouldnt be read from this
https://github.com/yxlllc/DDSP-SVC/blob/master/ddsp/vocoder.py

Does this mean I loaded the wrong model?
model downloaded from https://github.com/yxlllc/DDSP-SVC/releases/download/4.0/opencpop+kiritan.zip
and had .../DDSP-SVC-4.0/exp/diffusion-new-demo/model_200000.pt loaded (i suppose thats done by chosing it) chosen.
Also the demo itself works fine (command lines) but gui does close itself.

Hardware Reqs

hey, can you provide hardware needed to run this, please?

Training

How long does it take usually to train a voice model? I have 1 hour wav file, GTX 1660 super.

DDSP Gui meaning

what does kstep and phase vocoder means? is it accent control?

问题

有没有地方有大家分享的训练好的模型

训练没效果

我使用 4.0 训练了 10k 和 100k 后进行了对比,转换出来的音频没有任何差异,与目标音色差距也非常大。

训练过程均使用的默认配置与默认的预训练模型,没有做任何改动。

image

Language

When I downloaded and installed, it was in Chinese (I think) and I couldn't change it to English

ModuleNotFoundError: jax requires jaxlib to be installed. Windows 11

Hello. Recently, I've downgraded my CUDA to version 11.8(V11.8.89) and cuDNN to 8.6.0 for running another program. I'm not sure if this has caused the issue, but when I try to run DDSP, I encounter an error stating that 'jaxlib' is not installed, and thus I'm unable to use it.

Here are the commands I ran:

PS C:\Users\Pawn\DDSP-SVC> python main_diff.py -i input\audio.wav -ddsp exp\test\conbsub\model_90000.pt -diff exp\test\diff\model_15000.pt -o output\audio.wav -k 4 -id 1 -diffid 1 -speedup 1 -method dpm-solver -kstep 30
Traceback (most recent call last):
File "C:\Users\Pawn\AppData\Local\Programs\Python\Python310\lib\site-packages\jax_src\lib_init_.py", line 24, in
import jaxlib as jaxlib
ModuleNotFoundError: No module named 'jaxlib'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "C:\Users\Pawn\DDSP-SVC\main_diff.py", line 12, in
from ddsp.vocoder import load_model, F0_Extractor, Volume_Extractor, Units_Encoder
File "C:\Users\Pawn\DDSP-SVC\ddsp\vocoder.py", line 10, in
from transformers import HubertModel, Wav2Vec2FeatureExtractor
File "C:\Users\Pawn\AppData\Local\Programs\Python\Python310\lib\site-packages\transformers_init_.py", line 26, in
from . import dependency_versions_check
File "C:\Users\Pawn\AppData\Local\Programs\Python\Python310\lib\site-packages\transformers\dependency_versions_check.py", line 17, in
from .utils.versions import require_version, require_version_core
File "C:\Users\Pawn\AppData\Local\Programs\Python\Python310\lib\site-packages\transformers\utils_init_.py", line 30, in
from .generic import (
File "C:\Users\Pawn\AppData\Local\Programs\Python\Python310\lib\site-packages\transformers\utils\generic.py", line 33, in
import jax.numpy as jnp
File "C:\Users\Pawn\AppData\Local\Programs\Python\Python310\lib\site-packages\jax_init_.py", line 35, in
from jax import config as _config_module
File "C:\Users\Pawn\AppData\Local\Programs\Python\Python310\lib\site-packages\jax\config.py", line 17, in
from jax._src.config import config # noqa: F401
File "C:\Users\Pawn\AppData\Local\Programs\Python\Python310\lib\site-packages\jax_src\config.py", line 24, in
from jax.src import lib
File "C:\Users\Pawn\AppData\Local\Programs\Python\Python310\lib\site-packages\jax_src\lib_init
.py", line 26, in
raise ModuleNotFoundError(
ModuleNotFoundError: jax requires jaxlib to be installed. See https://github.com/google/jax#installation for installation instructions.


and I have already tried these:
C:\Users\Pawn>pip install --upgrade "jax"
Requirement already satisfied: jax in c:\users\pawn\appdata\local\programs\python\python310\lib\site-packages (0.4.12)
Requirement already satisfied: ml-dtypes>=0.1.0 in c:\users\pawn\appdata\local\programs\python\python310\lib\site-packages (from jax) (0.2.0)
Requirement already satisfied: numpy>=1.21 in c:\users\pawn\appdata\local\programs\python\python310\lib\site-packages (from jax) (1.23.5)
Requirement already satisfied: opt-einsum in c:\users\pawn\appdata\local\programs\python\python310\lib\site-packages (from jax) (3.3.0)
Requirement already satisfied: scipy>=1.7 in c:\users\pawn\appdata\local\programs\python\python310\lib\site-packages (from jax) (1.9.3)

C:\Users\Pawn>pip install --upgrade "jaxlib"
ERROR: Could not find a version that satisfies the requirement jaxlib (from versions: none)
ERROR: No matching distribution found for jaxlib


How can I resolve this issue?

ImportError 内存资源不足,无法处理此命令

❯ python train_diff.py -c configs/diffusion-new.yaml
Traceback (most recent call last):
  File "train_diff.py", line 7, in <module>
    from diffusion.vocoder import Vocoder, Unit2Mel, Unit2Wav
  File "D:\AI\DDSP-SVC\diffusion\vocoder.py", line 11, in <module>
    from ddsp.vocoder import CombSubFast
  File "D:\AI\DDSP-SVC\ddsp\vocoder.py", line 7, in <module>
    import parselmouth
ImportError: DLL load failed while importing parselmouth: 内存资源不足,无法处理此命令。

但是我还有20G空闲内存?

Failure at "audio_callback" in gui_diff.py preventing usage

Of my sound devices, it works fine with my USB headset, but attempting to use pipewire, default (which is a Pulse backend), or Jack results in different errors. I'm not convinced one (or all) of these aren't a sounddevice issue.

Still, the result is no audio with any device selections besides directly to my USB headset.

event: start_vc
input device:21:default (ALSA)
output device:21:default (ALSA)
crossfade_time:0.06
buffer_num:4
samplerate:44100
block_time:0.8
prefix_pad_length:3.1100000000000003
mix_mode:None
using_cuda:True
 [DDSP Model] Combtooth Subtractive Synthesiser
 [Loading] /Sabrent/gpt/DDSP-SVC/exp/diffusion-test/model_100000.pt
 [Encoder Model] Content Vec
 [Loading] pretrain/contentvec/checkpoint_best_legacy_500.pt
2023-10-31 17:04:17 | INFO | fairseq.tasks.hubert_pretraining | current directory is /Sabrent/gpt/DDSP-SVC
2023-10-31 17:04:17 | INFO | fairseq.tasks.hubert_pretraining | HubertPretrainingTask Config {'_name': 'hubert_pretraining', 'data': 'metadata', 'fine_tuning': False, 'labels': ['km'], 'label_dir': 'label', 'label_rate': 50.0, 'sample_rate': 16000, 'normalize': False, 'enable_padding': False, 'max_keep_size': None, 'max_sample_size': 250000, 'min_sample_size': 32000, 'single_target': False, 'random_crop': True, 'pad_audio': False}
2023-10-31 17:04:17 | INFO | fairseq.models.hubert.hubert | HubertModel Config: {'_name': 'hubert', 'label_rate': 50.0, 'extractor_mode': default, 'encoder_layers': 12, 'encoder_embed_dim': 768, 'encoder_ffn_embed_dim': 3072, 'encoder_attention_heads': 12, 'activation_fn': gelu, 'layer_type': transformer, 'dropout': 0.1, 'attention_dropout': 0.1, 'activation_dropout': 0.0, 'encoder_layerdrop': 0.05, 'dropout_input': 0.1, 'dropout_features': 0.1, 'final_dim': 256, 'untie_final_proj': True, 'layer_norm_first': False, 'conv_feature_layers': '[(512,10,5)] + [(512,3,2)] * 4 + [(512,2,2)] * 2', 'conv_bias': False, 'logit_temp': 0.1, 'target_glu': False, 'feature_grad_mult': 0.1, 'mask_length': 10, 'mask_prob': 0.8, 'mask_selection': static, 'mask_other': 0.0, 'no_mask_overlap': False, 'mask_min_space': 1, 'mask_channel_length': 10, 'mask_channel_prob': 0.0, 'mask_channel_selection': static, 'mask_channel_other': 0.0, 'no_mask_channel_overlap': False, 'mask_channel_min_space': 1, 'conv_pos': 128, 'conv_pos_groups': 16, 'latent_temp': [2.0, 0.5, 0.999995], 'skip_masked': False, 'skip_nomask': False, 'checkpoint_activations': False, 'required_seq_len_multiple': 2, 'depthwise_conv_kernel_size': 31, 'attn_type': '', 'pos_enc_type': 'abs', 'fp16': False}

Starting callback
Infering...
Audio block passed.
Audio block passed.
Audio block passed.
Audio block passed.
| Load HifiGAN:  pretrain/nsf_hifigan/model
...
sola_shift: 0
Exception ignored from cffi callback <function _StreamBase.__init__.<locals>.callback_ptr at 0x7fa5f96b6f70>:
Traceback (most recent call last):
  File "/Sabrent/gpt/DDSP-SVC/venv/lib64/python3.9/site-packages/sounddevice.py", line 886, in callback_ptr
    return _wrap_callback(
  File "/Sabrent/gpt/DDSP-SVC/venv/lib64/python3.9/site-packages/sounddevice.py", line 2687, in _wrap_callback
    callback(*args)
  File "/Sabrent/gpt/DDSP-SVC/gui_diff.py", line 489, in audio_callback
    outdata[:] = temp_wav[: - self.crossfade_frame, None].repeat(1, 2).cpu().numpy()
ValueError: could not broadcast input array from shape (35280,2) into shape (35280,64)
Audio block passed.
Audio block passed.
Audio block passed.
Audio block passed.
event: stop_vc
Audio block passed.
ENDing VC

When using "pipewire":

event: start_vc
input device:21:default (ALSA)
output device:21:default (ALSA)
crossfade_time:0.06
buffer_num:4
samplerate:44100
block_time:0.8
prefix_pad_length:3.1100000000000003
mix_mode:None
using_cuda:True
 [DDSP Model] Combtooth Subtractive Synthesiser
/Sabrent/gpt/DDSP-SVC/venv/lib64/python3.9/site-packages/torch/nn/utils/weight_norm.py:30: UserWarning: torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.
  warnings.warn("torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.")
 [Loading] /Sabrent/gpt/DDSP-SVC/exp/diffusion-test/model_100000.pt
 [Encoder Model] Content Vec
 [Loading] pretrain/contentvec/checkpoint_best_legacy_500.pt
2023-10-31 17:04:17 | INFO | fairseq.tasks.hubert_pretraining | current directory is /Sabrent/gpt/DDSP-SVC
2023-10-31 17:04:17 | INFO | fairseq.tasks.hubert_pretraining | HubertPretrainingTask Config {'_name': 'hubert_pretraining', 'data': 'metadata', 'fine_tuning': False, 'labels': ['km'], 'label_dir': 'label', 'label_rate': 50.0, 'sample_rate': 16000, 'normalize': False, 'enable_padding': False, 'max_keep_size': None, 'max_sample_size': 250000, 'min_sample_size': 32000, 'single_target': False, 'random_crop': True, 'pad_audio': False}
2023-10-31 17:04:17 | INFO | fairseq.models.hubert.hubert | HubertModel Config: {'_name': 'hubert', 'label_rate': 50.0, 'extractor_mode': default, 'encoder_layers': 12, 'encoder_embed_dim': 768, 'encoder_ffn_embed_dim': 3072, 'encoder_attention_heads': 12, 'activation_fn': gelu, 'layer_type': transformer, 'dropout': 0.1, 'attention_dropout': 0.1, 'activation_dropout': 0.0, 'encoder_layerdrop': 0.05, 'dropout_input': 0.1, 'dropout_features': 0.1, 'final_dim': 256, 'untie_final_proj': True, 'layer_norm_first': False, 'conv_feature_layers': '[(512,10,5)] + [(512,3,2)] * 4 + [(512,2,2)] * 2', 'conv_bias': False, 'logit_temp': 0.1, 'target_glu': False, 'feature_grad_mult': 0.1, 'mask_length': 10, 'mask_prob': 0.8, 'mask_selection': static, 'mask_other': 0.0, 'no_mask_overlap': False, 'mask_min_space': 1, 'mask_channel_length': 10, 'mask_channel_prob': 0.0, 'mask_channel_selection': static, 'mask_channel_other': 0.0, 'no_mask_channel_overlap': False, 'mask_channel_min_space': 1, 'conv_pos': 128, 'conv_pos_groups': 16, 'latent_temp': [2.0, 0.5, 0.999995], 'skip_masked': False, 'skip_nomask': False, 'checkpoint_activations': False, 'required_seq_len_multiple': 2, 'depthwise_conv_kernel_size': 31, 'attn_type': '', 'pos_enc_type': 'abs', 'fp16': False}

Starting callback
Infering...
Audio block passed.
Audio block passed.
Audio block passed.
Audio block passed.
| Load HifiGAN:  pretrain/nsf_hifigan/model
...
Audio block passed.
Removing weight norm...
sola_shift: 0
Exception ignored from cffi callback <function _StreamBase.__init__.<locals>.callback_ptr at 0x7fa5801e1700>:
Traceback (most recent call last):
  File "/Sabrent/gpt/DDSP-SVC/venv/lib64/python3.9/site-packages/sounddevice.py", line 886, in callback_ptr
    return _wrap_callback(
  File "/Sabrent/gpt/DDSP-SVC/venv/lib64/python3.9/site-packages/sounddevice.py", line 2687, in _wrap_callback
    callback(*args)
  File "/Sabrent/gpt/DDSP-SVC/gui_diff.py", line 489, in audio_callback
    outdata[:] = temp_wav[: - self.crossfade_frame, None].repeat(1, 2).cpu().numpy()
ValueError: could not broadcast input array from shape (35280,2) into shape (35280,64)
Audio block passed.
Audio block passed.
Audio block passed.
Audio block passed.
event: stop_vc
Audio block passed.
ENDing VC

The last one, JACK, is the most baffling. It dies with SIGKILL, which I'm not issuing myself. I see no messages in the journalctl about it whatsoever, either, so I'm not sure what's causing it:

event: start_vc
input device:22:G733 Gaming Headset Mono (JACK Audio Connection Kit)
output device:25:G733 Gaming Headset Analog Stereo (JACK Audio Connection Kit)
crossfade_time:0.06
buffer_num:4
samplerate:44100
block_time:0.8
prefix_pad_length:3.1100000000000003
mix_mode:None
using_cuda:True
 [DDSP Model] Combtooth Subtractive Synthesiser
 [Loading] /Sabrent/gpt/DDSP-SVC/exp/diffusion-test/model_100000.pt

Starting callback
Infering...
Audio block passed.
Killed
(venv) [doneill@galena DDSP-SVC]$ 

关于webUI突然打不开了,一直显示loading?(The webUI won't open, it keeps showing loading.)

正常成功运行了1周左右,一直能正常使用(成功训练+成功推理),但1个小时前突然webUI打不开了,一直显示loading,中间换了很多浏览器,关闭了所有浏览器插件,关闭了所有电脑软件以及杀毒软件,然后重启了好多次电脑都没有用。

I've been running it normally and successfully for about 1 week, and it's been working fine (successful training + successful reasoning), but 1 hour ago suddenly the webUI wouldn't open, it kept showing loading, and in between I switched browsers a lot, turned off all the browser plugins, turned off all the computer software as well as the antivirus software, and then restarted the computer many times to no avail.

I wonder what enhancer_adaptive_key does

I'm not sure what the function of enhancer_adaptive_key is and how it differs from a simple key variable. After using it, the key of the original music is the same, but the singer's tone seems to be applied as the tone when the tone is a little higher. Is this correct?

Solution to serious memory leaks in preprocessing under Linux | 在Linux下面进行预处理发生严重内存泄漏的解决方法

Please use the following command to force pytorch to update to the nightly version
请用下面的命令把pytorch强制更新到nightly版本

cu118:pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu118 --force-reinstall
cu121:pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu121 --force-reinstall

Which Units_Encoder is preferable?

Hi, @yxlllc Great work. Thank you

'hubertsoft', 'hubertbase', 'hubertbase768', 'contentvec', 'contentvec768' or 'contentvec768l12'

To balance the problem of content information loss and timbre leakage, Which Units_Encoder is preferable?

Can I use a TPU?

If anyone has used it, I was wondering if it's faster than the TASLA T4?

加载底模进行训练时报错

PS G:\DDSP-SVC> python train.py -c configs/combsub.yaml
 > config: configs/combsub.yaml
 >    exp: exp/combsub-test
 [DDSP Model] Combtooth Subtractive Synthesiser
 [*] restoring model from exp/combsub-test\model_300000.pt
Traceback (most recent call last):
  File "train.py", line 68, in <module>
    initial_global_step, model, optimizer = utils.load_model(args.env.expdir, model, optimizer, device=args.device)
  File "G:\DDSP-SVC\logger\utils.py", line 119, in load_model
    model.load_state_dict(ckpt['model'])
  File "C:\Users\29099\.virtualenvs\DDSP-SVC-YOgpXN-h\lib\site-packages\torch\nn\modules\module.py", line 2041, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for CombSubFast:
        Unexpected key(s) in state_dict: "unit2ctrl.spk_embed.weight".

配置文件:

data:
  f0_extractor: 'parselmouth' # 'parselmouth', 'dio', 'harvest', or 'crepe'
  f0_min: 65 # about C2
  f0_max: 800 # about G5
  sampling_rate: 44100
  block_size: 512 # Equal to hop_length
  duration: 2 # Audio duration during training, must be less than the duration of the shortest audio clip
  encoder: 'hubertsoft' # 'hubertsoft', 'hubertbase' or 'contentvec'
  encoder_sample_rate: 16000
  encoder_hop_size: 320
  encoder_out_channels: 256
  encoder_ckpt: pretrain/hubert/hubert-soft-0d54a1f4.pt
  train_path: data/train # Create a folder named "audio" under this path and put the audio clip in it
  valid_path: data/val # Create a folder named "audio" under this path and put the audio clip in it
model:
  type: 'CombSubFast'
  n_spk: 1 # max number of different speakers
enhancer:
    type: 'nsf-hifigan'
    ckpt: 'pretrain/nsf_hifigan/model'
loss:
  fft_min: 256
  fft_max: 2048
  n_scale: 4 # rss kernel numbers
device: cuda
env:
  expdir: exp/combsub-test
  gpu_id: 0
train:
  num_workers: 0 # If your cpu and gpu are both very strong, set to 0 may be faster!
  batch_size: 24
  cache_all_data: true # Save Internal-Memory or Graphics-Memory if it is false, but may be slow
  cache_device: 'cuda' # Set to 'cuda' to cache the data into the Graphics-Memory, fastest speed for strong gpu
  cache_fp16: true
  epochs: 100000
  interval_log: 10
  interval_val: 2000
  lr: 0.0005
  weight_decay: 0

FileNotFoundError: [Errno 2] No such file or directory: 'config.yaml'

Hi,
First, I got error of ValueError: [x] Unknown Model: DiffusionNew. After reading your solution i left the model address on the left side of gui, empty and i read and saved config file in exp/diffusion-test directory in gui. when i press start conversion, i see this error: FileNotFoundError: [Errno 2] No such file or directory: 'config.yaml'
I checked, config.yaml is in exp/diffusion-test directory.
Please let me know what to do. Thanks.

Nan Loss everytime i train

Traceback (most recent call last):
File "C:\Users\bencj\Desktop\DDSP\DDSP-SVC-master\train.py", line 92, in
train(args, initial_global_step, model, optimizer, loss_func, loader_train, loader_valid)
File "C:\Users\bencj\Desktop\DDSP\DDSP-SVC-master\solver.py", line 100, in train
raise ValueError(' [x] nan loss ')
ValueError: [x] nan loss

do you know how to resolve this issue?

change voice fail by command

I use command to convert wav files,sounds is change,but not change to chino's voice, did I miss anything? this is command:

cd F:\sd-webui\DDSP\DDSP-SVC; ./runtime/Scripts/activate.bat; ./runtime/python.exe main.py -i F:\sd-webui\DDSP\test\test.wav -m exp/model_chino.pt -o F:\sd-webui\DDSP\test\chino.wav -k 0 -id 1 -e true -eak 0

output:
`
2023-04-17 10:05:36 | INFO | fairseq.tasks.text_to_speech | Please install tensorboardX: pip install tensorboardX

[DDSP Model] Combtooth Subtractive Synthesiser

[Loading] exp/model_chino.pt

Pitch extractor type: crepe

Extracting the pitch curve of the input audio...

Extracting the volume envelope of the input audio...

[Encoder Model] HuBERT Soft

[Loading] pretrain/hubert/hubert-soft-0d54a1f4.pt

Enhancer type: nsf-hifigan

| Load HifiGAN: pretrain/nsf_hifigan/model

Removing weight norm...

Speaker ID: 1

Cut the input audio into 2 slices

100%
`

this command fail too,
cd F:\sd-webui\DDSP\DDSP-SVC; ./runtime/Scripts/activate.bat; ./runtime/python.exe main.py -i F:\sd-webui\DDSP\test\test.wav -m exp/model_chino.pt -o F:\sd-webui\DDSP\test\chino.wav -k 0 -id 1 -e false

output:
`
2023-04-17 10:07:58 | INFO | fairseq.tasks.text_to_speech | Please install tensorboardX: pip install tensorboardX

[DDSP Model] Combtooth Subtractive Synthesiser

[Loading] exp/model_chino.pt

Pitch extractor type: crepe

Extracting the pitch curve of the input audio...

Extracting the volume envelope of the input audio...

[Encoder Model] HuBERT Soft

[Loading] pretrain/hubert/hubert-soft-0d54a1f4.pt

Enhancer type: none (using raw output of ddsp)

Speaker ID: 1

Cut the input audio into 2 slices

100%
`

Traceback

I saw the following traceback after I was trying to upload the vocal-only targeted audio and infer based on the two models I'd just trainned (i.e., ddsp model: 4000 steps, diffusion model: 6000 steps), how can I fix this?
Can anyone pls help? Thanks!

Traceback (most recent call last):
File "C:\Users\user\Downloads\DDSP-SVC-3.0\workenv\lib\site-packages\gradio\routes.py", line 393, in run_predict
output = await app.get_blocks().process_api(
File "C:\Users\user\Downloads\DDSP-SVC-3.0\workenv\lib\site-packages\gradio\blocks.py", line 1111, in process_api
data = self.postprocess_data(fn_index, result["prediction"], state)
File "C:\Users\user\Downloads\DDSP-SVC-3.0\workenv\lib\site-packages\gradio\blocks.py", line 1036, in postprocess_data
prediction_value = postprocess_update_dict(
File "C:\Users\user\Downloads\DDSP-SVC-3.0\workenv\lib\site-packages\gradio\blocks.py", line 432, in postprocess_update_dict
prediction_value["value"] = block.postprocess(prediction_value["value"])
File "C:\Users\user\Downloads\DDSP-SVC-3.0\workenv\lib\site-packages\gradio\components.py", line 2427, in postprocess
file_path = self.make_temp_copy_if_needed(y)
File "C:\Users\user\Downloads\DDSP-SVC-3.0\workenv\lib\site-packages\gradio\components.py", line 245, in make_temp_copy_if_needed
temp_dir = self.hash_file(file_path)
File "C:\Users\user\Downloads\DDSP-SVC-3.0\workenv\lib\site-packages\gradio\components.py", line 217, in hash_file
with open(file_path, "rb") as f:
FileNotFoundError: [Errno 2] No such file or directory: 'output\1751_Vocal_WAV__1688874956.wav'

diff模型保存问题

大佬这个interval_force_save指的是保存最近2w的模型么,
现在diff模型只会保存一个是不是应该改成 interval_force_save==0
image
image

推理时报错:OSError('Model file not found: pretrain/checkpoint_best_legacy_500.pt')我已经在正确的位置放置了checkpoint_best_legacy_500.pt,但报错依旧

推理时报错:OSError('Model file not found: pretrain/checkpoint_best_legacy_500.pt')我已经在正确的位置放置了checkpoint_best_legacy_500.pt,但报错依旧
报错信息:
Traceback (most recent call last):
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python310\lib\site-packages\gradio\routes.py", line 488, in run_predict
output = await app.get_blocks().process_api(
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python310\lib\site-packages\gradio\blocks.py", line 1431, in process_api
result = await self.call_function(
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python310\lib\site-packages\gradio\blocks.py", line 1109, in call_function
prediction = await anyio.to_thread.run_sync(
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python310\lib\site-packages\anyio\to_thread.py", line 33, in run_sync
return await get_asynclib().run_sync_in_worker_thread(
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python310\lib\site-packages\anyio_backends_asyncio.py", line 877, in run_sync_in_worker_thread
return await future
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python310\lib\site-packages\anyio_backends_asyncio.py", line 807, in run
result = context.run(func, *args)
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python310\lib\site-packages\gradio\utils.py", line 706, in wrapper
response = f(*args, **kwargs)
File "E:\GIT\so-vits-svc\webUI.py", line 129, in modelAnalysis
raise gr.Error(e)
gradio.exceptions.Error: OSError('Model file not found: pretrain/checkpoint_best_legacy_500.pt')

打开gui.py提示sounddevice.PortAudioError: Error opening Stream: Illegal combination of I/O devices [PaErrorCode -9993]

python 3.8
cuda11.8
torch2.0.0
torchaudio 2.0.1
windows10
所有的模型都配好了,报错如下,请问应如何解决
PS D:\AI\Audio\DDSP-SVC\DDSP-SVC> D:\AI\Audio\DDSP-SVC\Python38\python.exe gui.py --help
2023-05-06 07:16:43 | INFO | fairseq.tasks.text_to_speech | Please install tensorboardX: pip install tensorboardX
event: stop_vc
event: start_vc
input device:1:阵列麦克风 (AMD Audio Device) (MME)
output device:7:扬声器 (Realtek(R) Audio) (Windows DirectSound)
crossfade_time:0.04
buffer_num:4
samplerate:44100
block_time:0.3
prefix_pad_length:1.13
mix_mode:None
enhancer:True
using_cuda:True
[DDSP Model] Combtooth Subtractive Synthesiser
[Loading] exp\multi_speaker\model_300000.pt
[Encoder Model] HuBERT Soft
[Loading] pretrain/hubert/hubert-soft-0d54a1f4.pt
| Load HifiGAN: pretrain/nsf_hifigan/model
Removing weight norm...
Exception in thread Thread-1:
Traceback (most recent call last):
File "D:\AI\Audio\DDSP-SVC\Python38\lib\threading.py", line 932, in _bootstrap_inner
self.run()
File "D:\AI\Audio\DDSP-SVC\Python38\lib\threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "gui.py", line 370, in soundinput
with sd.Stream(callback=self.audio_callback, blocksize=self.block_frame, samplerate=self.config.samplerate,
File "D:\AI\Audio\DDSP-SVC\Python38\lib\site-packages\sounddevice.py", line 1800, in init
_StreamBase.init(self, kind='duplex', wrap_callback='array',
File "D:\AI\Audio\DDSP-SVC\Python38\lib\site-packages\sounddevice.py", line 898, in init
_check(_lib.Pa_OpenStream(self._ptr, iparameters, oparameters,
File "D:\AI\Audio\DDSP-SVC\Python38\lib\site-packages\sounddevice.py", line 2747, in _check
raise PortAudioError(errormsg, err)
sounddevice.PortAudioError: Error opening Stream: Illegal combination of I/O devices [PaErrorCode -9993]

Questions about preprocessing methods

Is there a big difference between using the Combtooth subtractive synthesizer method and the Sinusoids additive synthesizer method in the preprocessing process? If so, which one produces better results?

Is there a pre-trained model for ContentVec encoder?

Thanks for releasing the pretrained model for DDSP training, but the model seems to only be applicable to the Hubertsoft encoder. I would like to ask if there are any pre-trained models based on ContentVec(768layer12). If not, are there any plans to release such models in the future?

音域问题

训练人的音域如果和目标人不一致,有什么好的处理办法吗

CUDA out of Memory Error

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 7.66 GiB (GPU 0; 8.00 GiB total capacity; 809.64 MiB already allocated; 5.02 GiB free; 1.18 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Thank you for wonderful program! But i have this problem. Can you teach me how to fix it?

I can't get it to work.

I tried to do it, but I get this.
Traceback (most recent call last):
File "D:\DDSP-SVC\train.py", line 94, in
train(args, initial_global_step, model, optimizer, loss_func, loader_train, loader_valid)
File "D:\DDSP-SVC\solver.py", line 83, in train
for batch_idx, data in enumerate(loader_train):
File "D:\DDSP-SVC\venv\lib\site-packages\torch\utils\data\dataloader.py", line 633, in next
data = self._next_data()
File "D:\DDSP-SVC\venv\lib\site-packages\torch\utils\data\dataloader.py", line 1348, in _next_data
return self._process_data(data)
File "D:\DDSP-SVC\venv\lib\site-packages\torch\utils\data\dataloader.py", line 1374, in _process_data
data.reraise()
File "D:\DDSP-SVC\venv\lib\site-packages\torch_utils.py", line 665, in reraise
raise exception
RecursionError: Caught RecursionError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "D:\DDSP-SVC\venv\lib\site-packages\torch\utils\data_utils\worker.py", line 308, in _worker_loop
data = fetcher.fetch(index)
File "D:\DDSP-SVC\venv\lib\site-packages\torch\utils\data_utils\fetch.py", line 51, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "D:\DDSP-SVC\venv\lib\site-packages\torch\utils\data_utils\fetch.py", line 51, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "D:\DDSP-SVC\data_loaders.py", line 187, in getitem
return self.getitem( (file_idx + 1) % len(self.paths))
File "D:\DDSP-SVC\data_loaders.py", line 187, in getitem
return self.getitem( (file_idx + 1) % len(self.paths))
File "D:\DDSP-SVC\data_loaders.py", line 187, in getitem
return self.getitem( (file_idx + 1) % len(self.paths))
[Previous line repeated 1988 more times]
File "D:\DDSP-SVC\data_loaders.py", line 186, in getitem
if data_buffer['duration'] < (self.waveform_sec + 0.1):
RecursionError: maximum recursion depth exceeded in comparison

error while starting training

File "D:\New folder (4)\DDSP-SVC\diffusion\data_loaders.py", line 202, in getitem
if data_buffer['duration'] < (self.waveform_sec + 0.1):
RecursionError: maximum recursion depth exceeded in comparison

损失降到多少效果就可以了?

大佬大佬,我训练的时候batch/s只有1左右,训练100000epochs太久了,而且用预训练模型的话,训练的loss基本都不会降了。。。

Error in the inference process

Traceback (most recent call last):
File "D:\mnt\0)DDSP-SVC\main.py", line 261, in
seg_output, _, (s_h, s_n) = model(seg_units, seg_f0, seg_volume, spk_id = spk_id, spk_mix_dict = spk_mix_dict)
File "C:\Users\wasan\AppData\Local\Programs\Python\Python39\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "C:\Users\wasan\AppData\Local\Programs\Python\Python39\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "D:\mnt\0)DDSP-SVC\ddsp\vocoder.py", line 628, in forward
ctrls, hidden = self.unit2ctrl(units_frames, f0_frames, phase_frames, volume_frames, spk_id=spk_id, spk_mix_dict=spk_mix_dict, aug_shift=aug_shift)
File "C:\Users\wasan\AppData\Local\Programs\Python\Python39\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "C:\Users\wasan\AppData\Local\Programs\Python\Python39\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "D:\mnt\0)DDSP-SVC\ddsp\unit2control.py", line 78, in forward
x = self.stack(units.transpose(1,2)).transpose(1,2)
File "C:\Users\wasan\AppData\Local\Programs\Python\Python39\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "C:\Users\wasan\AppData\Local\Programs\Python\Python39\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "C:\Users\wasan\AppData\Local\Programs\Python\Python39\lib\site-packages\torch\nn\modules\container.py", line 215, in forward
input = module(input)
File "C:\Users\wasan\AppData\Local\Programs\Python\Python39\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "C:\Users\wasan\AppData\Local\Programs\Python\Python39\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "C:\Users\wasan\AppData\Local\Programs\Python\Python39\lib\site-packages\torch\nn\modules\conv.py", line 310, in forward
return self._conv_forward(input, self.weight, self.bias)
File "C:\Users\wasan\AppData\Local\Programs\Python\Python39\lib\site-packages\torch\nn\modules\conv.py", line 306, in _conv_forward
return F.conv1d(input, weight, bias, self.stride,
RuntimeError: Given groups=1, weight of size [256, 768, 3], expected input[1, 256, 450] to have 768 channels, but got 256 channels instead

Is there a way to fix this?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.