zfturbo / music-source-separation-training Goto Github PK

View Code? Open in Web Editor NEW

339.0 339.0 45.0 298 KB

Repository for training models for music source separation.

Python 100.00%

music-source-separation-training's People

Contributors

Stargazers

Watchers

music-source-separation-training's Issues

bs roformer stem model

I am able to download the weights but there is no config file link it just takes me back to here

Increasing the speed of training

@ZFTurbo, Привет еще раз, заметил что Вы из России/Казахстана, приятно видеть это)
У меня появилась идея, я портировал Ваш скрипт на Google Kaggle (выложу его на GitHub чуть позже), как думаете, возможно ли ускорить тренировку модели путем запуска алгоритма на нескольких серверах?

Представим, что у нас есть 4 бесплатных аккаунта Google Collab с GPU T4. Было бы здорово если можно на 4 аккаунтах запустить скрипт тренировки, и Google Коллабы обменивались через DDP (PyTorch Distributed Data Parallel) или Horovod (фреймворк, разработанный Uber, для распределенного обучения моделей глубокого обучения).

Это сильно ускорит тренировку моделей, и выйдет бесплатно, чем аренда сервера с 8xTESLA T4.
Я готов написать код который будет запускать на разных Google Collab аккаунтах скрипт. Нужно лишь связать сервера.

Я не знаю тонкостей обучения моделей, хочу послушать Ваше мнение на счет этого, еще раз спасибо за большой вклад в сферу нейросетей и музыки.

Также подскажите пожалуйста кол-во треков в Вашем датасете для mdx23c. Я нашел отличный ресурс где можно получить оригинальные стемсы популярных треков студийного качества. https://songstems.net/ Возможно это будет полезно для вас) Хотелось бы собрать датасет но возможно будет проблема с загрузкой его куда-либо, авторские права и DMCA как никак(
Есть возможность, получить ваш Телеграм?

Training outcome issues

Hi author, I used htdemucs to train musdb18hq dataset, and trained more than 700 epochs why the vocals result is still only 4.56db

Ensemble Demixing using multiple models at once?

Hello, I have the seperation training setup properly and am able to utilize which model I choose. This is great, but I see that the leader boards have first place using MDX23C and BS Roformer (https://mvsep.com/quality_checker/entry/6240).

I've tried but can't get multiple models to be used at once - is there a way for us to use two models in an ensemble mode at once like the top of the leaderboards did using this separation training script?

Thanks so much!

Potential optimizing training?

Hi @ZFTurbo. First of all, thank you very much for this repo, well beyond my expectations (bs-roformer training) with the other models.

Training settings:

I use the config_musdb18_mel_band_roformer.yaml with only two parameters changes: batch_size: 2 and target_instrument: drums.

Partial training report:

------------------------------------------
  Epoch  |   Train loss   |   SDR drums
------------------------------------------
    0    |    2.016814    |    2.3636
    1    |    1.836035    |    2.7670
    2    |    1.768249    |    3.6295
    3    |    1.754987    |    3.6963
    4    |    1.698568    |    3.4976
    5    |    1.665295    |    4.1241
    6    |    1.657550    |    4.4150
    7    |    1.617158    |    4.4690
    8    |    1.600137    |    4.5702
    9    |    1.594573    |    4.6221
   10    |    1.522084    |    4.8008
   11    |    1.524944    |    4.9101
   12    |    1.504670    |    5.0437
   13    |    1.478117    |    5.1029
   14    |    ........    |    ......
------------------------------------------

Training timing:

Mean Train Time (1000 steps / epoch) : 4 min 20 sec
Mean Valid Time (  50 steps / epoch) : 8 min 00 sec

EDIT: with a batch_size of 1, I have a better training time : ~ 2 min 30. 
Probably a question of low VRAM size. The valid time stay the same.

Computer settings:

CPU : 13th Gen Intel® Core™ i9-13900K
GPU : NVIDIA GeForce RTX 4080 (16GB)
RAM : 64GB
SSD : Kingston FURY Renegade (2TB)
HDD : Seagate SkyHawk AI (16TB)

Questions:

What do you think about the training report values (loss/SDR)? Seem good no?
Are the timing correct for my configuration?
Is the validation duration normal? (2/3 of the total duration of an epoch)

About the provided pre-trained checkpoints:

The training configuration is set to 1000 epochs but I would like to know how many epoch you have to do to get the Vocal SDR 8.42 on the pre-trained (same model)? Currently, for a maximum of 1000 epochs training, the whole duration on my settings will be : 1000 * 12 min 20 sec, so 740000 sec / ~ 12333 min / ~ 205 hours / ~ 8-9 days.

Potential training time optimization

The current training dataset model is based on direct-from-disk loading for audio chunk. I heard the 'digging' of the reading at each step during training. The MUSDB18HQ is (only) 29GB of size for the whole dataset, so can fit on the RAM on most of modern computer? Don't you think that this could improve the performance of the training steps and that it could be an additional option?

Thank a lot for your answer!

UVR ONNX SUPPORT?

Why not release the weights in .onnx so we can use it in UVR directly? or is there any way to do this easily?

inferencing

I tried to use a pretrained vocal model for bs roformer and this is the error i get, no idea how to fix it

PS C:\Users\matt\Downloads\Music-Source-Separation-Training-v.1.0.3\Music-Source-Separation-Training-v.1.0.3> python inference.py --model_type bs_roformer --config_path "C:\Users\matt\Downloads\model_bs_roformer_ep_317_sdr_12.9755.yaml.txt" --start_check_point
"C:\Users\matt\Downloads\model_bs_roformer_ep_317_sdr_12.9755 (1).ckpt" --input_folder
"C:\Users\matt\Downloads\input" --store_dir "C:\Users\matt\Downloads\Music" --device_ids 0

Instruments: ['vocals', 'other']
Traceback (most recent call last):
File "inference.py", line 99, in
proc_folder(None)
File "inference.py", line 75, in proc_folder
model = get_model_from_config(args.model_type, config)
File "C:\Users\matt\Downloads\Music-Source-Separation-Training-v.1.0.3\Music-Source-Separation-Training-v.1.0.3\utils.py", line 28, in get_model_from_config
model = BSRoformer(
File "<@beartype(models.bs_roformer.bs_roformer.BSRoformer.init) at 0x24e8d24f160>", line 98, in init
TypeError: init() got an unexpected keyword argument 'linear_transformer_depth'
PS C:\Users\matt\Downloads\Music-Source-Separation-Training-v.1.0.3\Music-Source-Separation-Training-v.1.0.3>

Training colab for custom dataset

Hi,
This is a really great repo. Can you share a sample colab notebook, that we can link with drive and train on different custom datasets? Also, I am a newbie in this domain. What should be the folder structure?

I have two folders

Tabla wav files (Indian musical instrument)
Dhaak wav files(Indian musical instrument)

SDR not changing between epoches

This may not be an actual issue, but my training method. Would like thoughts on my application.

I'm using the starting weight from mdx23c and the vocals config file.
I have ~300 tracks that I'm training with and 5 for testing. Which takes about 30 minutes per epoch on my 12 GB card.
I used to have the datasets split 80% (215 tracks) training and 20% (85 tracks) testing which would take over an hour just for the testing part.

SDR for the 80-20 dataset were Avg 13.8
SDR for the reduced testing are Avg 14.9

However, after the first training epoch any subsequent epoch the SDR stagnates in both training methods. And since my card has 12GB I can only afford to batch train on a single song, not the 6 for the project configured 48 GB memory cards. This is the only change I made to the configure file.

`audio:
chunk_size: 261120
dim_f: 4096
dim_t: 256
hop_length: 1024
n_fft: 8192
num_channels: 2
sample_rate: 44100
min_mean_abs: 0.001

model:
act: gelu
bottleneck_factor: 4
growth: 128
norm: InstanceNorm
num_blocks_per_scale: 2
num_channels: 128
num_scales: 5
num_subbands: 4
scale:

training:
batch_size: 1
gradient_accumulation_steps: 1
grad_clip: 0
instruments:

vocals
other
lr: 9.0e-05
patience: 2
reduce_factor: 0.95
target_instrument: null
num_epochs: 1000
num_steps: 1000
augmentation: false # enable augmentations by audiomentations and pedalboard
augmentation_type: simple1
use_mp3_compress: false # Deprecated
augmentation_mix: true # Mix several stems of the same type with some probability
augmentation_loudness: true # randomly change loudness of each stem
augmentation_loudness_type: 1 # Type 1 or 2
augmentation_loudness_min: 0.5
augmentation_loudness_max: 1.5
q: 0.95
coarse_loss_clip: true
ema_momentum: 0.999
optimizer: adam
other_fix: true # it's needed for checking on multisong dataset if other is actually instrumental

inference:
batch_size: 1
dim_t: 256
num_overlap: 4`

What might explain the stagnating SDR? Could the batch size configuration be to blame or testing size? Would restarting the training after epoch 0 be advisable since it does improve the SDR after the first epoch, but perhaps it's overtraining then?

I need to get rid of the error.

I am still a beginner in programming.
But I would love to use this wonderful program of yours.
However, after downloading it, I have all the necessary files and when I run the code, I get an error.
I don't know how to solve it for me.
Could you please tell me what is wrong?
This programming of yours is going to be a big dream come true for me.
Thanks.

(base) C:\Users\User\Desktop\Music-Source-Separation-Training-main>python train.py
Traceback (most recent call last):
File "C:\Users\User\Desktop\Music-Source-Separation-Training-main\train.py", line 515, in
train_model(None)
File "C:\Users\User\Desktop\Music-Source-Separation-Training-main\train.py", line 332, in train_model
model, config = get_model_from_config(args.model_type, args.config_path)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\User\Desktop\Music-Source-Separation-Training-main\utils.py", line 14, in get_model_from_config
with open(config_path) as f:
^^^^^^^^^^^^^^^^^
TypeError: expected str, bytes or os.PathLike object, not NoneType

Support for single channel output

There's a lot of similarity between enhancement and deverberation models with source separation. They can use the same backbones, etc.

Really the only difference is that you are outputting only a single channel. The noise or reverberation you remove isn't good to minimize loss over, only the clean signal.

It would be great if this repo supported single channel output datasets!

(And, you could consider adding speech enhancement backbones like sepformer from speechbrain. This new diffusion based model, SGMSE has good pretrained checkpoints that would probably be good after finetuning for separating vocals from non vocals, assuming the new finetuning included reverberated vocals.)

Training loss: 0.000000；SDR value remains unchanged.

Hello, author, why is the SDR the same for every epoch after 60 epochs of training using the musdb18hq dataset, and the training loss remains 0 all the time? I used the following command:

inference with swin_upernet fails

py -m inference.py --model swin_upernet \
  --config_path config_vocals_swin_upernet.yaml \
  --start_check_point /models/MSST/model_swin_upernet_ep_56_sdr_10.6703.ckpt ...

config downloaded from releases.

Failed to import transformers.models.upernet.modeling_upernet because of the following error (look up to see its traceback):
Failed to import transformers.generation.utils because of the following error (look up to see its traceback):
cannot import name 'split_torch_state_dict_into_shards' from 'huggingface_hub'

split_torch_state_dict_into_shards is included starting from huggingface-hub version 0.23.0.
tokenizers 0.14.0 and 0.14.1 requires huggingface-hub<18.0
and transformers==4.35.0 requires tokenizers=0.14.*

Installed by changing requirements.txt:

huggingface-hub>=0.23
transformers~=4.35.0

But:

ERROR: Could not find a version that satisfies the requirement pedalboard==0.8.1 (from versions: 0.8.2, 0.8.3, 0.8.4, 0.8.5, 0.8.6, 0.8.7, 0.8.8, 0.8.9, 0.9.0, 0.9.1, 0.9.2, 0.9.3, 0.9.4, 0.9.5, 0.9.6, 0.9.7, 0.9.8, 0.9.9, 0.9.10, 0.9.11, 0.9.12)
ERROR: No matching distribution found for pedalboard==0.8.1

So:
pedalboard~=0.8.1

Then:

--> 110 with torch.amp.autocast(enabled=config.training.use_amp):
    111     with torch.inference_mode():
    112         if config.training.target_instrument is not None:

File c:\apps\MSST\Lib\site-packages\ml_collections\config_dict\config_dict.py:829, in ConfigDict.__getattr__(self, attribute)
    827   return self[attribute]
    828 except KeyError as e:
--> 829   raise AttributeError(e)

AttributeError: "'use_amp'"

So, a few errors later, utils.py:

- with torch.cuda.amp.autocast(enabled=config.training.use_amp):
+ with torch.amp.autocast(device):

Then:
ValueError: Make sure that the channel dimension of the pixel values match with the one set in the configuration.

So, I looked at:
if len(batch_data) >= batch_size or (i >= mix.shape[1]):
arr = torch.stack(batch_data, dim=0)
arr.shape = torch.Size([1, 2, 261632])

I was confused, so mix.shape = torch.Size([2, 3038448]). I didn't think it would work, but I tried concatenate:
arr = torch.cat(batch_data, dim=0)
arr.shape = torch.Size([2, 261632])

The value error went away, but:

    [177](file:///C:/apps/MSST/models/upernet_swin_transformers.py:177) def cac2cws(self, x):
    [178](file:///C:/apps/MSST/models/upernet_swin_transformers.py:178)     k = self.num_subbands
--> [179](file:///C:/apps/MSST/models/upernet_swin_transformers.py:179)     b, c, f, t = x.shape
    [180](file:///C:/apps/MSST/models/upernet_swin_transformers.py:180)     x = x.reshape(b, c, k, f // k, t)
    [181](file:///C:/apps/MSST/models/upernet_swin_transformers.py:181)     x = x.reshape(b, c * k, f // k, t)

ValueError: not enough values to unpack (expected 4, got 3)

I'm out of ideas now.

extract_instrumental

python inference.py \
--model_type mdx23c
--config_path configs/config_mdx23c_musdb18.yaml
--start_check_point results/last_mdx23c.ckpt
--input_folder input/wavs/
--store_dir separation_results/

--extract_instrumental ??

How should I write the extract_instrumental field?

torchseg model gpu usage on and off

it really slows down training because it just stops using the gpu randomly and then starts again...

Config:

audio:
  chunk_size: 261632
  dim_f: 4096
  dim_t: 512
  hop_length: 512
  n_fft: 8192
  num_channels: 2
  sample_rate: 44100
  min_mean_abs: 0.001

model:
  encoder_name: resnet152 # look with torchseg.list_encoders(). Currently 858 available
  decoder_type: fpn # unet, fpn
  act: gelu
  num_channels: 128
  num_subbands: 8

training:
  batch_size: 12
  gradient_accumulation_steps: 1
  grad_clip: 0
  instruments:
  - vocals
  - bass
  - drums
  - other
  lr: 5.0e-05
  patience: 3
  reduce_factor: 0.95
  target_instrument: null
  num_epochs: 1000
  num_steps: 2000
  q: 0.95
  coarse_loss_clip: true
  ema_momentum: 0.999
  optimizer: adamw
  other_fix: false # it's needed for checking on multisong dataset if other is actually instrumental
  use_amp: true # enable or disable usage of mixed precision (float16) - usually it must be true

augmentations:
  enable: true # enable or disable all augmentations (to fast disable if needed)
  loudness: true # randomly change loudness of each stem on the range (loudness_min; loudness_max)
  loudness_min: 0.5
  loudness_max: 1.5
  mixup: true # mix several stems of same type with some probability (only works for dataset types: 1, 2, 3)
  mixup_probs: !!python/tuple # 2 additional stems of the same type (1st with prob 0.2, 2nd with prob 0.02)
    - 0.2
    - 0.02
  mixup_loudness_min: 0.5
  mixup_loudness_max: 1.5

  # apply mp3 compression to mixture only (emulate downloading mp3 from internet)
  mp3_compression_on_mixture: 0.01
  mp3_compression_on_mixture_bitrate_min: 32
  mp3_compression_on_mixture_bitrate_max: 320
  mp3_compression_on_mixture_backend: "lameenc"

  all:
    channel_shuffle: 0.5 # Set 0 or lower to disable
    random_inverse: 0.1 # inverse track (better lower probability)
    random_polarity: 0.5 # polarity change (multiply waveform to -1)
    mp3_compression: 0.01
    mp3_compression_min_bitrate: 32
    mp3_compression_max_bitrate: 320
    mp3_compression_backend: "lameenc"

  vocals:
    pitch_shift: 0.1
    pitch_shift_min_semitones: -5
    pitch_shift_max_semitones: 5
    seven_band_parametric_eq: 0.25
    seven_band_parametric_eq_min_gain_db: -9
    seven_band_parametric_eq_max_gain_db: 9
    tanh_distortion: 0.1
    tanh_distortion_min: 0.1
    tanh_distortion_max: 0.7
  other:
    pitch_shift: 0.1
    pitch_shift_min_semitones: -4
    pitch_shift_max_semitones: 4
    gaussian_noise: 0.1
    gaussian_noise_min_amplitude: 0.001
    gaussian_noise_max_amplitude: 0.015
    time_stretch: 0.01
    time_stretch_min_rate: 0.8
    time_stretch_max_rate: 1.25


inference:
  batch_size: 1
  dim_t: 512
  num_overlap: 1

Image of gpu use:

Issue with inference on swin_upernet models: ValueError: Make sure that the channel dimension of the pixel values match with the one set in the configuration.

I was trying to test out the pre trained swin_upernet model you provided but uncountered the following:

Traceback (most recent call last):
  File "E:\Music-Source-Separation-Training-main\Music-Source-Separation-Training-main\inference.py", line 99, in <module>
    proc_folder(None)
  File "E:\Music-Source-Separation-Training-main\Music-Source-Separation-Training-main\inference.py", line 95, in proc_folder
    run_folder(model, args, config, device, verbose=False)
  File "E:\Music-Source-Separation-Training-main\Music-Source-Separation-Training-main\inference.py", line 44, in run_folder
    res = demix_track(config, model, mixture, device)
  File "E:\Music-Source-Separation-Training-main\Music-Source-Separation-Training-main\utils.py", line 62, in demix_track
    x = model(part.unsqueeze(0))[0]
  File "C:\Users\wesle\AppData\Roaming\Python\Python39\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "C:\Users\wesle\AppData\Roaming\Python\Python39\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "E:\Music-Source-Separation-Training-main\Music-Source-Separation-Training-main\models\upernet_swin_transformers.py", line 201, in forward
    x = self.swin_upernet_model(x).logits
  File "C:\Users\wesle\AppData\Roaming\Python\Python39\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "C:\Users\wesle\AppData\Roaming\Python\Python39\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\wesle\AppData\Roaming\Python\Python39\site-packages\transformers\models\upernet\modeling_upernet.py", line 406, in forward
    outputs = self.backbone.forward_with_filtered_kwargs(
  File "C:\Users\wesle\AppData\Roaming\Python\Python39\site-packages\transformers\utils\backbone_utils.py", line 210, in forward_with_filtered_kwargs
    return self(*args, **filtered_kwargs)
  File "C:\Users\wesle\AppData\Roaming\Python\Python39\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "C:\Users\wesle\AppData\Roaming\Python\Python39\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\wesle\AppData\Roaming\Python\Python39\site-packages\transformers\models\swin\modeling_swin.py", line 1313, in forward
    embedding_output, input_dimensions = self.embeddings(pixel_values)
  File "C:\Users\wesle\AppData\Roaming\Python\Python39\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "C:\Users\wesle\AppData\Roaming\Python\Python39\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\wesle\AppData\Roaming\Python\Python39\site-packages\transformers\models\swin\modeling_swin.py", line 263, in forward
    embeddings, output_dimensions = self.patch_embeddings(pixel_values)
  File "C:\Users\wesle\AppData\Roaming\Python\Python39\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "C:\Users\wesle\AppData\Roaming\Python\Python39\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\wesle\AppData\Roaming\Python\Python39\site-packages\transformers\models\swin\modeling_swin.py", line 315, in forward
    raise ValueError(
ValueError: Make sure that the channel dimension of the pixel values match with the one set in the configuration.

I've made no changes to the configs and have tried updating my packages but no luck.

Any plans to support exporting onnx and TensorRT engine?

for bs roformer model

validation dataset folder structure for dataset type 2

Hi, I'm training a vocals_mdx23c so am using dataset 2 from my understanding

The training folder and config file is all set up
I've confirmed this is working with a quick run
The validation folder isn't correct and is bugging

I'm a little confused, how do I set up the validation folder? The same as the training folder (training -> (vocals and other folders))

Ensemble Script

Regarding the ensemble script, an explanation of the types could be added, and also, when I use the “fft” types, the following error occurs

Traceback (most recent call last):
  File "G:\ensemble\ensemble.py", line 162, in <module>
    ensemble_files(None)
  File "G:\ensemble\ensemble.py", line 156, in ensemble_files
    res = average_waveforms(data, weights, args.type)
  File "G:\ensemble\ensemble.py", line 89, in average_waveforms
    spec = stft(pred_track[i], 2048, 1024)
  File "G:\ensemble\ensemble.py", line 14, in stft
    spec_left = librosa.stft(wave_left, nfft, hop_length=hl)
TypeError: stft() takes 1 positional argument but 2 positional arguments (and 1 keyword-only argument) were given

With the following command
python ensemble.py --files "1_01. Wings of Reality_(Vocals).flac" "1_01. Wings of Reality_MDX23C-8KFFT-InstVoc_HQ_(Vocals).flac" --type avg_fft --weights 2 1

error in demix_track function

i am training mel_band_roformer and changed train.py line 500 if 0 to if 1
when starting SDR evaluation

  File "train.py", line 517, in <module>
    train_model(None)
  File "train.py", line 501, in train_model
    sdr_avg = valid(model, args, config, device, verbose=False)
  File "train.py", line 136, in valid
    res = demix_track(config, model, mixture, device)
  File "utils.py", line 130, in demix_track
    result[..., start:start+l] += x[j][..., :l].cpu() * window[..., :l]
RuntimeError: The size of tensor a (131418) must match the size of tensor b (131584) at non-singleton dimension 1

arr.size() is torch.Size([4, 2, 131584])
x.size() is torch.Size([4, 2, 131418])

MelBand RoFormer Vocal pre-trained model: Missing key(s) in state_dict and size mismatch for mask_estimators

Errors when running inference using the pre-trained MelBand RoFormer Vocal model and the associated Config file.

RuntimeError: Error(s) in loading state_dict for MelBandRoformer:
	Missing key(s) in state_dict: "mask_estimators.0.to_freqs.0.0.4.weight", "mask_estimators.0.to_freqs.0.0.4.bias", ... 
	size mismatch for mask_estimators.0.to_freqs.0.0.2.weight: copying a param with shape torch.Size([56, 768]) from checkpoint, the shape in current model is torch.Size([768, 768]).
	size mismatch for mask_estimators.0.to_freqs.0.0.2.bias: copying a param with shape torch.Size([56]) from checkpoint, the shape in current model is torch.Size([768]). ...

There are some differences between the file linked in readme and the file in the main branch, but the errors happen with either file.

I don't have much useful information to add, but I did make a Colab notebook to eliminate variables. You can use it, too. It takes about four minutes to install and produce the error. https://colab.research.google.com/drive/1XTTWrs-FJKotFYtH8goTpaT3lIOTKUv1?usp=sharing

setup.py? Or at least top-level package? To work with this as a library

Do you have any interest into making a setup.py or at least a top-level package?

I have been working with your code in a pipeline of my own. However, it's annoying to import the packages here because I already have a train and utils to import.

So import music_source_separation_training.train would be great!

Post your model

To post your model, please, fill the form:

Description: 
Instruments:
Dataset (if known):
Metrics (if known):
Config link: 
Checkpoint link:

what dataset you use for vocals_mdx23c model？

Hi,
What dataset you use for vocals_mdx23c model training ? I want to reproduce your training and based on which to make improvement.

BandIt Plus weights

I think there is a mistake here
model_bandit_plus_dnr_sdr_11.47.chpt ,it must be .ckpt file not .chpt file
am i right ?

Creating a checkpoint fails + saving several model files

I use MDX23C for training and did a test run of 80 epochs but after running the training script again it started at epoch 0 instead of continue training. My folders are on C:.

It also doesn't show the loss while training on CMD.

Also, is it possible that it saves a model when the sdr is better than before?
(Example: epoch 50 had the last best sdr but now epoch 60 has a better sdr than epoch 50 etc.)

Post dataset

We can try to gather all public and private datasets in one place. To post dataset, please, fill the form:

Dataset name:
Description:
Instuments: 
Format: sample rate and compression
Volume: number of tracks
Size: in GB
Download link:

Training freezes when multiple GPUs are used

When I try training with a single GPU (RTX 6000 Ada), it runs fine, but if I try it with multiple GPUs (using the same config file), it gets stuck here:

Instruments: ['vocal', 'guitar', 'piano', 'drums', 'bass', 'synth']
Use augmentation for training
Loading songs data from cache: /workspace/results2/metadata_1.pkl. If you updated dataset remove metadata_1.pkl before training!
Found tracks in dataset: 15300
Use multi GPU: [0, 1, 2]
Patience: 2 Reduce factor: 0.95 Batch size: 12 Grad accum steps: 1 Effective batch size: 12
Train for: 1000
Train epoch: 0 Learning rate: 9e-05
0%| | 0/1000 [00:00<?, ?it/s]

I am calling the script like this:
!python Music-Source-Separation-Training/train.py
--model_type htdemucs
--config_path /workspace/config.yaml
--results_path /workspace/results/
--data_path /workspace/StemGMD/training_output
--valid_path /workspace/StemGMD/training_output_eval
--dataset_type 1
--num_workers 8
--device_ids 0 1 2

My config file:

audio:
chunk_size: 485100 # samplerate * segment
min_mean_abs: 0.000
hop_length: 1024

training:
batch_size: 4
gradient_accumulation_steps: 1
grad_clip: 0
segment: 11
shift: 1
samplerate: 44100
channels: 2
normalize: true
instruments: ['vocal', 'guitar', 'piano', 'drums', 'bass', 'synth']
target_instrument: null
num_epochs: 1000
num_steps: 1000
optimizer: adam
lr: 9.0e-05
patience: 2
reduce_factor: 0.95
q: 0.95
coarse_loss_clip: true
ema_momentum: 0.999
other_fix: false # it's needed for checking on multisong dataset if other is actually instrumental
use_amp: true # enable or disable usage of mixed precision (float16) - usually it must be true

augmentations:
enable: true # enable or disable all augmentations (to fast disable if needed)
loudness: true # randomly change loudness of each stem on the range (loudness_min; loudness_max)
loudness_min: 0.5
loudness_max: 1.5

inference:
num_overlap: 4
batch_size: 8

model: hdemucs

hdemucs: # see demucs/hdemucs.py for a detailed description
channels: 48
channels_time: null
growth: 2
nfft: 4096
wiener_iters: 0
end_iters: 0
wiener_residual: False
cac: True
depth: 6
rewrite: True
hybrid: True
hybrid_old: False
multi_freqs: []
multi_freqs_depth: 3
freq_emb: 0.2
emb_scale: 10
emb_smooth: True
kernel_size: 8
stride: 4
time_stride: 2
context: 1
context_enc: 0
norm_starts: 4
norm_groups: 4
dconv_mode: 1
dconv_depth: 2
dconv_comp: 4
dconv_attn: 4
dconv_lstm: 4
dconv_init: 0.001
rescale: 0.1

Inference killed by OOM reaper when using checkpoints

I'm running the following command on a server with 32GB of RAM, 8GB of swap, without a GPU and it's getting killed after a few seconds. It will fill both memory and swap completely before getting killed.

python3 inference.py --model_type htdemucs --config_path configs/config_musdb18_htdemucs.yaml --start_check_point checkpoints/htdemucs_ft_bass.th --input_folder /input --store_dir /output/

Is there a way to limit the amount of RAM being used ?

Dataset size

As I understand, you train vocal models on you internal dataset, but could you share the total length of audio duration? I want to understand what the dataset size is approximately needed to get such metrics.

Bandit v2 inference error: Unexpected key(s) in state_dict: "epoch", "global_step", "pytorch-lightning_version", "state_dict", "loops", "callbacks", "optimizer_states", "lr_schedulers".

Hi, I'm using windows + anaconda.
I've downloaded the bandit v2 model "checkpoint-eng.ckpt" from https://zenodo.org/records/12701995 (which was linked in the https://github.com/kwatcharasupat/bandit-v2 repository).
I've put the model in the models folder, but when I run the following command I'm getting an error:

(msst_2) C:\Users\dreiD\Documents\github\Music-Source-Separation-Training>python inference.py --model_type bandit_v2 --config_path configs\config_dnr_bandit_v2_mus64.yaml --start_check_point models\checkpoint-eng.ckpt --input_folder input --store_dir separation_results
Start from checkpoint: models\checkpoint-eng.ckpt
Traceback (most recent call last):
File "C:\Users\dreiD\Documents\github\Music-Source-Separation-Training\inference.py", line 129, in
proc_folder(None)
File "C:\Users\dreiD\Documents\github\Music-Source-Separation-Training\inference.py", line 109, in proc_folder
model.load_state_dict(state_dict)
File "C:\Users\dreiD\anaconda3\envs\msst_2\lib\site-packages\torch\nn\modules\module.py", line 2041, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for Bandit:
Missing key(s) in state_dict: "stft.window", "istft.window", "band_split.norm_fc_modules.0.combined.0.weight", "band_split.norm_fc_modules.0.combined.0.bias", "band_split.norm_fc_modules.0.combined.1.weight", "band_split.norm_fc_modules.0.combined.1.bias", "band_split.norm_fc_modules.1.combined.0.weight", "band_split.norm_fc_modules.1.combined.0.bias", "band_split.norm_fc_modules.1.combined.1.weight", "band_split.norm_fc_modules.1.combined.1.bias", "band_split.norm_fc_modules.2.combined.0.weight", ....
...
"band_split.norm_fc_modules.63.combined.0.bias", "band_split.norm_fc_modules.63.combined.1.weight", "band_split.norm_fc_modules.63.combined.1.bias", "tf_model.seqband.0.norm.weight", "tf_model.seqband.0.norm.bias", "tf_model.seqband.0.rnn.weight_ih_l0", "tf_model.seqband.0.rnn.weight_hh_l0", "tf_model.seqband.0.rnn.bias_ih_l0", "tf_model.seqband.0.rnn.bias_hh_l0", "tf_model.seqband.0.rnn.weight_ih_l0_reverse", "tf_model.seqband.0.rnn.weight_hh_l0_reverse", "tf_model.seqband.0.rnn.bias_ih_l0_reverse", "tf_model.seqband.0.rnn.bias_hh_l0_reverse", "tf_model.seqband.0.fc.weight", "tf_model.seqband.0.fc.bias", "tf_model.seqband.2.norm.weight", "tf_model.seqband.2.norm.bias", "tf_model.seqband.2.rnn.weight_ih_l0", "tf_model.seqband.2.rnn.weight_hh_l0", "tf_model.seqband.2.rnn.bias_ih_l0", "tf_model.seqband.2.rnn.bias_hh_l0", "tf_model.seqband.2.rnn.weight_ih_l0_reverse", "tf_model.seqband.2.rnn.weight_hh_l0_reverse", "tf_model.seqband.2.rnn.bias_ih_l0_reverse", "tf_model.seqband.2.rnn.bias_hh_l0_reverse", "tf_model.seqband.2.fc.weight", "tf_model.seqband.2.fc.bias", "tf_model.seqband.4.norm.weight", "tf_model.seqband.4.norm.bias", "tf_model.seqband.4.rnn.weight_ih_l0", "tf_model.seqband.4.rnn.weight_hh_l0", "tf_model.seqband.4.rnn.bias_ih_l0", "tf_model.seqband.4.rnn.bias_hh_l0", "tf_model.seqband.4.rnn.weight_ih_l0_reverse", "tf_model.seqband.4.rnn.weight_hh_l0_reverse", "tf_model.seqband.4.rnn.bias_ih_l0_reverse", "tf_model.seqband.4.rnn.bias_hh_l0_reverse", "tf_model.seqband.4.fc.weight", "tf_model.seqband.4.fc.bias", "tf_model.seqband.6.norm.weight", "tf_model.seqband.6.norm.bias", "tf_model.seqband.6.rnn.weight_ih_l0", ...
"tf_model.seqband.30.rnn.weight_ih_l0", "tf_model.seqband.30.rnn.weight_hh_l0", "tf_model.seqband.30.rnn.bias_ih_l0", "tf_model.seqband.30.rnn.bias_hh_l0", "tf_model.seqband.30.rnn.weight_ih_l0_reverse", "tf_model.seqband.30.rnn.weight_hh_l0_reverse", "tf_model.seqband.30.rnn.bias_ih_l0_reverse", "tf_model.seqband.30.rnn.bias_hh_l0_reverse", "tf_model.seqband.30.fc.weight", "tf_model.seqband.30.fc.bias", "mask_estim.speech.freq_weights/0", "mask_estim.speech.freq_weights/1", "mask_estim.speech.freq_weights/2", "mask_estim.speech.freq_weights/3", "mask_estim.speech.freq_weights/4", "mask_estim.speech.freq_weights/5", "mask_estim.speech.freq_weights/6", "mask_estim.speech.freq_weights/7", "mask_estim.speech.freq_weights/8", "mask_estim.speech.freq_weights/9", "mask_estim.speech.freq_weights/10", "mask_estim.speech.freq_weights/11", "mask_estim.speech.freq_weights/12", "mask_estim.speech.freq_weights/13", "mask_estim.speech.freq_weights/14", "mask_estim.speech.freq_weights/15", "mask_estim.speech.freq_weights/16", "mask_estim.speech.freq_weights/17", "mask_estim.speech.freq_weights/18", "mask_estim.speech.freq_weights/19", "mask_estim.speech.freq_weights/20", "mask_estim.speech.freq_weights/21", "mask_estim.speech.freq_weights/22", "mask_estim.speech.freq_weights/23", "mask_estim.speech.freq_weights/24", "mask_estim.speech.freq_weights/25", "mask_estim.speech.freq_weights/26", "mask_estim.speech.freq_weights/27", "mask_estim.speech.freq_weights/28", "mask_estim.speech.freq_weights/29", "mask_estim.speech.freq_weights/30", "mask_estim.speech.freq_weights/31", "mask_estim.speech.freq_weights/32", "mask_estim.speech.freq_weights/33", "mask_estim.speech.freq_weights/34", "mask_estim.speech.freq_weights/35", "mask_estim.speech.freq_weights/36", "mask_estim.speech.freq_weights/37", "mask_estim.speech.freq_weights/38", "mask_estim.speech.freq_weights/39", "mask_estim.speech.freq_weights/40", "mask_estim.speech.freq_weights/41", "mask_estim.speech.freq_weights/42", "mask_estim.speech.freq_weights/43", "mask_estim.speech.freq_weights/44", "mask_estim.speech.freq_weights/45", "mask_estim.speech.freq_weights/46", "mask_estim.speech.freq_weights/47", "mask_estim.speech.freq_weights/48", "mask_estim.speech.freq_weights/49", "mask_estim.speech.freq_weights/50", "mask_estim.speech.freq_weights/51", "mask_estim.speech.freq_weights/52", "mask_estim.speech.freq_weights/53", "mask_estim.speech.freq_weights/54", "mask_estim.speech.freq_weights/55", "mask_estim.speech.freq_weights/56", "mask_estim.speech.freq_weights/57", "mask_estim.speech.freq_weights/58", "mask_estim.speech.freq_weights/59", "mask_estim.speech.freq_weights/60", "mask_estim.speech.freq_weights/61", "mask_estim.speech.freq_weights/62", "mask_estim.speech.freq_weights/63", "mask_estim.speech.norm_mlp.0.norm.weight", "mask_estim.speech.norm_mlp.0.norm.bias", "mask_estim.speech.norm_mlp.0.hidden.0.weight", "mask_estim.speech.norm_mlp.0.hidden.0.bias", "mask_estim.speech.norm_mlp.0.output.0.weight", "mask_estim.speech.norm_mlp.0.output.0.bias", "mask_estim.speech.norm_mlp.0.combined.0.weight", "mask_estim.speech.norm_mlp.0.combined.0.bias", "mask_estim.speech.norm_mlp.0.combined.1.0.weight", "mask_estim.speech.norm_mlp.0.combined.1.0.bias", "mask_estim.speech.norm_mlp.0.combined.2.0.weight", "mask_estim.speech.norm_mlp.0.combined.2.0.bias", "mask_estim.speech.norm_mlp.1.norm.weight", "mask_estim.speech.norm_mlp.1.norm.bias", "mask_estim.speech.norm_mlp.1.hidden.0.weight", "mask_estim.speech.norm_mlp.1.hidden.0.bias", "mask_estim.speech.norm_mlp.1.output.0.weight", "mask_estim.speech.norm_mlp.1.output.0.bias", "mask_estim.speech.norm_mlp.1.combined.0.weight", "mask_estim.speech.norm_mlp.1.combined.0.bias", "mask_estim.speech.norm_mlp.1.combined.1.0.weight", "mask_estim.speech.norm_mlp.1.combined.1.0.bias", "mask_estim.speech.norm_mlp.1.combined.2.0.weight", "mask_estim.speech.norm_mlp.1.combined.2.0.bias", "mask_estim.speech.norm_mlp.2.norm.weight", "mask_estim.speech.norm_mlp.2.norm.bias", "mask_estim.speech.norm_mlp.2.hidden.0.weight", "mask_estim.speech.norm_mlp.2.hidden.0.bias", "mask_estim.speech.norm_mlp.2.output.0.weight", "mask_estim.speech.norm_mlp.2.output.0.bias", "mask_estim.speech.norm_mlp.2.combined.0.weight", "mask_estim.speech.norm_mlp.2.combined.0.bias", "mask_estim.speech.norm_mlp.2.combined.1.0.weight", "mask_estim.speech.norm_mlp.2.combined.1.0.bias", "mask_estim.speech.norm_mlp.2.combined.2.0.weight", "mask_estim.speech.norm_mlp.2.combined.2.0.bias", "mask_estim.speech.norm_mlp.3.norm.weight", "mask_estim.speech.norm_mlp.3.norm.bias", "mask_estim.speech.norm_mlp.3.hidden.0.weight", ...
...
"mask_estim.speech.norm_mlp.63.norm.bias", "mask_estim.speech.norm_mlp.63.hidden.0.weight", "mask_estim.speech.norm_mlp.63.hidden.0.bias", "mask_estim.speech.norm_mlp.63.output.0.weight", "mask_estim.speech.norm_mlp.63.output.0.bias", "mask_estim.speech.norm_mlp.63.combined.0.weight", "mask_estim.speech.norm_mlp.63.combined.0.bias", "mask_estim.speech.norm_mlp.63.combined.1.0.weight", "mask_estim.speech.norm_mlp.63.combined.1.0.bias", "mask_estim.speech.norm_mlp.63.combined.2.0.weight", "mask_estim.speech.norm_mlp.63.combined.2.0.bias", "mask_estim.music.freq_weights/0", "mask_estim.music.freq_weights/1", "mask_estim.music.freq_weights/2", "mask_estim.music.freq_weights/3", "mask_estim.music.freq_weights/4", "mask_estim.music.freq_weights/5", "mask_estim.music.freq_weights/6", "mask_estim.music.freq_weights/7", "mask_estim.music.freq_weights/8", "mask_estim.music.freq_weights/9", "mask_estim.music.freq_weights/10", "mask_estim.music.freq_weights/11", "mask_estim.music.freq_weights/12", "mask_estim.music.freq_weights/13", "mask_estim.music.freq_weights/14", "mask_estim.music.freq_weights/15", "mask_estim.music.freq_weights/16", "mask_estim.music.freq_weights/17", "mask_estim.music.freq_weights/18", "mask_estim.music.freq_weights/19", "mask_estim.music.freq_weights/20", "mask_estim.music.freq_weights/21", "mask_estim.music.freq_weights/22", "mask_estim.music.freq_weights/23", "mask_estim.music.freq_weights/24", "mask_estim.music.freq_weights/25", "mask_estim.music.freq_weights/26", "mask_estim.music.freq_weights/27", "mask_estim.music.freq_weights/28", "mask_estim.music.freq_weights/29", "mask_estim.music.freq_weights/30", "mask_estim.music.freq_weights/31", "mask_estim.music.freq_weights/32", "mask_estim.music.freq_weights/33", "mask_estim.music.freq_weights/34", "mask_estim.music.freq_weights/35", "mask_estim.music.freq_weights/36", "mask_estim.music.freq_weights/37", "mask_estim.music.freq_weights/38", "mask_estim.music.freq_weights/39", "mask_estim.music.freq_weights/40", "mask_estim.music.freq_weights/41", "mask_estim.music.freq_weights/42", "mask_estim.music.freq_weights/43", "mask_estim.music.freq_weights/44", "mask_estim.music.freq_weights/45", "mask_estim.music.freq_weights/46", "mask_estim.music.freq_weights/47", "mask_estim.music.freq_weights/48", "mask_estim.music.freq_weights/49", "mask_estim.music.freq_weights/50", "mask_estim.music.freq_weights/51", "mask_estim.music.freq_weights/52", "mask_estim.music.freq_weights/53", "mask_estim.music.freq_weights/54", "mask_estim.music.freq_weights/55", "mask_estim.music.freq_weights/56", "mask_estim.music.freq_weights/57", "mask_estim.music.freq_weights/58", "mask_estim.music.freq_weights/59", "mask_estim.music.freq_weights/60", "mask_estim.music.freq_weights/61", "mask_estim.music.freq_weights/62", "mask_estim.music.freq_weights/63", "mask_estim.music.norm_mlp.0.norm.weight", "mask_estim.music.norm_mlp.0.norm.bias", "mask_estim.music.norm_mlp.0.hidden.0.weight", "mask_estim.music.norm_mlp.0.hidden.0.bias", "mask_estim.music.norm_mlp.0.output.0.weight", "mask_estim.music.norm_mlp.0.output.0.bias", "mask_estim.music.norm_mlp.0.combined.0.weight", "mask_estim.music.norm_mlp.0.combined.0.bias", "mask_estim.music.norm_mlp.0.combined.1.0.weight", "mask_estim.music.norm_mlp.0.combined.1.0.bias", "mask_estim.music.norm_mlp.0.combined.2.0.weight", "mask_estim.music.norm_mlp.0.combined.2.0.bias", "mask_estim.music.norm_mlp.1.norm.weight", "mask_estim.music.norm_mlp.1.norm.bias", "mask_estim.music.norm_mlp.1.hidden.0.weight", "mask_estim.music.norm_mlp.1.hidden.0.bias", "mask_estim.music.norm_mlp.1.output.0.weight", "mask_estim.music.norm_mlp.1.output.0.bias", "mask_estim.music.norm_mlp.1.combined.0.weight", "mask_estim.music.norm_mlp.1.combined.0.bias", "mask_estim.music.norm_mlp.1.combined.1.0.weight", "mask_estim.music.norm_mlp.1.combined.1.0.bias", "mask_estim.music.norm_mlp.1.combined.2.0.weight", "mask_estim.music.norm_mlp.1.combined.2.0.bias", "mask_estim.music.norm_mlp.2.norm.weight", "mask_estim.music.norm_mlp.2.norm.bias", "mask_estim.music.norm_mlp.2.hidden.0.weight", "mask_estim.music.norm_mlp.2.hidden.0.bias", "mask_estim.music.norm_mlp.2.output.0.weight", "mask_estim.music.norm_mlp.2.output.0.bias", "mask_estim.music.norm_mlp.2.combined.0.weight", "mask_estim.music.norm_mlp.2.combined.0.bias", "mask_estim.music.norm_mlp.2.combined.1.0.weight", "mask_estim.music.norm_mlp.2.combined.1.0.bias", "mask_estim.music.norm_mlp.2.combined.2.0.weight", "mask_estim.music.norm_mlp.2.combined.2.0.bias", "mask_estim.music.norm_mlp.3.norm.weight", "mask_estim.music.norm_mlp.3.norm.bias", "mask_estim.music.norm_mlp.3.hidden.0.weight", "mask_estim.music.norm_mlp.3.hidden.0.bias", "mask_estim.music.norm_mlp.3.output.0.weight", "mask_estim.music.norm_mlp.3.output.0.bias", "mask_estim.music.norm_mlp.3.combined.0.weight", "mask_estim.music.norm_mlp.3.combined.0.bias", "mask_estim.music.norm_mlp.3.combined.1.0.weight", "mask_estim.music.norm_mlp.3.combined.1.0.bias", "mask_estim.music.norm_mlp.3.combined.2.0.weight", "mask_estim.music.norm_mlp.3.combined.2.0.bias", "mask_estim.music.norm_mlp.4.norm.weight", ...
...
"mask_estim.music.norm_mlp.63.norm.bias", "mask_estim.music.norm_mlp.63.hidden.0.weight", "mask_estim.music.norm_mlp.63.hidden.0.bias", "mask_estim.music.norm_mlp.63.output.0.weight", "mask_estim.music.norm_mlp.63.output.0.bias", "mask_estim.music.norm_mlp.63.combined.0.weight", "mask_estim.music.norm_mlp.63.combined.0.bias", "mask_estim.music.norm_mlp.63.combined.1.0.weight", "mask_estim.music.norm_mlp.63.combined.1.0.bias", "mask_estim.music.norm_mlp.63.combined.2.0.weight", "mask_estim.music.norm_mlp.63.combined.2.0.bias", "mask_estim.sfx.freq_weights/0", "mask_estim.sfx.freq_weights/1", "mask_estim.sfx.freq_weights/2", "mask_estim.sfx.freq_weights/3", "mask_estim.sfx.freq_weights/4", "mask_estim.sfx.freq_weights/5", "mask_estim.sfx.freq_weights/6", "mask_estim.sfx.freq_weights/7", "mask_estim.sfx.freq_weights/8", "mask_estim.sfx.freq_weights/9", "mask_estim.sfx.freq_weights/10", "mask_estim.sfx.freq_weights/11", "mask_estim.sfx.freq_weights/12", "mask_estim.sfx.freq_weights/13", "mask_estim.sfx.freq_weights/14", "mask_estim.sfx.freq_weights/15", "mask_estim.sfx.freq_weights/16", "mask_estim.sfx.freq_weights/17", "mask_estim.sfx.freq_weights/18", "mask_estim.sfx.freq_weights/19", "mask_estim.sfx.freq_weights/20", "mask_estim.sfx.freq_weights/21", "mask_estim.sfx.freq_weights/22", "mask_estim.sfx.freq_weights/23", "mask_estim.sfx.freq_weights/24", "mask_estim.sfx.freq_weights/25", "mask_estim.sfx.freq_weights/26", "mask_estim.sfx.freq_weights/27", "mask_estim.sfx.freq_weights/28", "mask_estim.sfx.freq_weights/29", "mask_estim.sfx.freq_weights/30", "mask_estim.sfx.freq_weights/31", "mask_estim.sfx.freq_weights/32", "mask_estim.sfx.freq_weights/33", "mask_estim.sfx.freq_weights/34", "mask_estim.sfx.freq_weights/35", "mask_estim.sfx.freq_weights/36", "mask_estim.sfx.freq_weights/37", "mask_estim.sfx.freq_weights/38", "mask_estim.sfx.freq_weights/39", "mask_estim.sfx.freq_weights/40", "mask_estim.sfx.freq_weights/41", "mask_estim.sfx.freq_weights/42", "mask_estim.sfx.freq_weights/43", "mask_estim.sfx.freq_weights/44", "mask_estim.sfx.freq_weights/45", "mask_estim.sfx.freq_weights/46", "mask_estim.sfx.freq_weights/47", "mask_estim.sfx.freq_weights/48", "mask_estim.sfx.freq_weights/49", "mask_estim.sfx.freq_weights/50", "mask_estim.sfx.freq_weights/51", "mask_estim.sfx.freq_weights/52", "mask_estim.sfx.freq_weights/53", "mask_estim.sfx.freq_weights/54", "mask_estim.sfx.freq_weights/55", "mask_estim.sfx.freq_weights/56", "mask_estim.sfx.freq_weights/57", "mask_estim.sfx.freq_weights/58", "mask_estim.sfx.freq_weights/59", "mask_estim.sfx.freq_weights/60", "mask_estim.sfx.freq_weights/61", "mask_estim.sfx.freq_weights/62", "mask_estim.sfx.freq_weights/63", "mask_estim.sfx.norm_mlp.0.norm.weight", "mask_estim.sfx.norm_mlp.0.norm.bias", "mask_estim.sfx.norm_mlp.0.hidden.0.weight", "mask_estim.sfx.norm_mlp.0.hidden.0.bias", "mask_estim.sfx.norm_mlp.0.output.0.weight", "mask_estim.sfx.norm_mlp.0.output.0.bias", "mask_estim.sfx.norm_mlp.0.combined.0.weight", "mask_estim.sfx.norm_mlp.0.combined.0.bias", "mask_estim.sfx.norm_mlp.0.combined.1.0.weight", "mask_estim.sfx.norm_mlp.0.combined.1.0.bias", "mask_estim.sfx.norm_mlp.0.combined.2.0.weight", "mask_estim.sfx.norm_mlp.0.combined.2.0.bias", "mask_estim.sfx.norm_mlp.1.norm.weight", "mask_estim.sfx.norm_mlp.1.norm.bias", "mask_estim.sfx.norm_mlp.1.hidden.0.weight", "mask_estim.sfx.norm_mlp.1.hidden.0.bias", "mask_estim.sfx.norm_mlp.1.output.0.weight", "mask_estim.sfx.norm_mlp.1.output.0.bias", "mask_estim.sfx.norm_mlp.1.combined.0.weight", "mask_estim.sfx.norm_mlp.1.combined.0.bias", "mask_estim.sfx.norm_mlp.1.combined.1.0.weight", "mask_estim.sfx.norm_mlp.1.combined.1.0.bias", "mask_estim.sfx.norm_mlp.1.combined.2.0.weight", "mask_estim.sfx.norm_mlp.1.combined.2.0.bias", "mask_estim.sfx.norm_mlp.2.norm.weight", "mask_estim.sfx.norm_mlp.2.norm.bias", "mask_estim.sfx.norm_mlp.2.hidden.0.weight", "mask_estim.sfx.norm_mlp.2.hidden.0.bias", "mask_estim.sfx.norm_mlp.2.output.0.weight", "mask_estim.sfx.norm_mlp.2.output.0.bias", "mask_estim.sfx.norm_mlp.2.combined.0.weight", "mask_estim.sfx.norm_mlp.2.combined.0.bias", "mask_estim.sfx.norm_mlp.2.combined.1.0.weight", "mask_estim.sfx.norm_mlp.2.combined.1.0.bias", "mask_estim.sfx.norm_mlp.2.combined.2.0.weight", "mask_estim.sfx.norm_mlp.2.combined.2.0.bias", "mask_estim.sfx.norm_mlp.3.norm.weight", "mask_estim.sfx.norm_mlp.3.norm.bias", "mask_estim.sfx.norm_mlp.3.hidden.0.weight", "mask_estim.sfx.norm_mlp.3.hidden.0.bias", "mask_estim.sfx.norm_mlp.3.output.0.weight", "mask_estim.sfx.norm_mlp.3.output.0.bias", "mask_estim.sfx.norm_mlp.3.combined.0.weight", "mask_estim.sfx.norm_mlp.3.combined.0.bias", "mask_estim.sfx.norm_mlp.3.combined.1.0.weight", "mask_estim.sfx.norm_mlp.3.combined.1.0.bias", "mask_estim.sfx.norm_mlp.3.combined.2.0.weight", "mask_estim.sfx.norm_mlp.3.combined.2.0.bias",
...
"mask_estim.sfx.norm_mlp.63.hidden.0.weight", "mask_estim.sfx.norm_mlp.63.hidden.0.bias", "mask_estim.sfx.norm_mlp.63.output.0.weight", "mask_estim.sfx.norm_mlp.63.output.0.bias", "mask_estim.sfx.norm_mlp.63.combined.0.weight", "mask_estim.sfx.norm_mlp.63.combined.0.bias", "mask_estim.sfx.norm_mlp.63.combined.1.0.weight", "mask_estim.sfx.norm_mlp.63.combined.1.0.bias", "mask_estim.sfx.norm_mlp.63.combined.2.0.weight", "mask_estim.sfx.norm_mlp.63.combined.2.0.bias".
Unexpected key(s) in state_dict: "epoch", "global_step", "pytorch-lightning_version", "state_dict", "loops", "callbacks", "optimizer_states", "lr_schedulers".

I've tried setting up a new conda env and re downloaded the repository but I'm getting the same error.
Other models like the bandit plus model did work.

Roformer for more than two stems?

Is roformer only configured for two stems (for example vocal and other)? I tried to run it with 4 stems (vocal / bass / guitar / drums) but it failed with a tensor size mismatch. Is there a config setting I'm missing? Thanks

inference error with pretrained mdx23c model

Hi, i run the inference with your pretrained mdx23c model and get following errors:

(py3.10) python inference.py --model_type mdx23c \                                                                                [17:48:11]
    --config_path configs/config_vocals_mdx23c.yaml \
    --start_check_point pretrained/model_vocals_mdx23c_sdr_10.17.ckpt \
    --input_folder /soundx_speech_data/eval_audio/singing \
    --store_dir separation_results/
Instruments: ['vocals', 'other']
Start from checkpoint: pretrained/model_vocals_mdx23c_sdr_10.17.ckpt
Total tracks found: 7
  0%|                                                                                                                  | 0/7 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/data1/martinzheng/tts/backend/Music-Source-Separation-Training/inference.py", line 99, in <module>
    proc_folder(None)
  File "/data1/martinzheng/tts/backend/Music-Source-Separation-Training/inference.py", line 95, in proc_folder
    run_folder(model, args, config, device, verbose=False)
  File "/data1/martinzheng/tts/backend/Music-Source-Separation-Training/inference.py", line 44, in run_folder
    res = demix_track(config, model, mixture, device)
  File "/data1/martinzheng/tts/backend/Music-Source-Separation-Training/utils.py", line 56, in demix_track
    mix = nn.functional.pad(mix, (border, border), mode='reflect')
NotImplementedError: Only 2D, 3D, 4D, 5D padding with non-constant padding are supported for now

any ideas?

MDX23C checkpoint training

Hi, I am a beginner in machine learning. I want to try to train the MUSDB18HQ dataset with additional STEMS. Am I correct that I can do this and I don't need to download 20GB dataset?

I tried to run the training but got the answer that there are 0 tracks in the dataset. How many VALID (TEST) tracks would you recommend to upload in relation to the whole dataset, 20% is ok?

test/valid tracks:

[REQ] add a (GH-compliant) license file

Hi there, 1st of all thanks for this awesome work !

Since we've 'doxed' it in our HyMPS project (under AUDIO section \ AI-based page \ Source Separation), can you please add a GH-compliant license file for it ?

As you know, expliciting licensing terms is extremely important to let anyone better/faster understand how to reuse/adapt/modify sources (and not only) in other open projects and vice-versa.

Although it may sounds like a minor aspect, license file omission obviously causes an inconsistent generation of the relative badge too:

(badge-generator URL: https://badgen.net/github/license/ZFTurbo/Music-Source-Separation-Training)

You can easily set a standardized one through the GH's license wizard tool.

Last but not least, let us know how we could improve - in your opinion - our categorizations and links to resources in order to favor collaboration between developers (and therefore evolution) of listed projects.

Hope that helps/inspires !

ModuleNotFoundError

I got the following error. Please let me know how to resolve it.

Traceback (most recent call last):
File "C:\Users\Users\desktop\Music-Source-Separation-Training-main\train.py", line 515, in
train_model(None)
File "C:\Users\Users\desktop\Music-Source-Separation-Training-main\train.py", line 332, in train_model
model, config = get_model_from_config(args.model_type, args.config_path)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Users\desktop\Music-Source-Separation-Training-main\utils.py", line 24, in get_model_from_config
from models.demucs4ht import get_model
File "C:\Users\Users\desktop\Music-Source-Separation-Training-main\models\demucs4ht.py", line 10, in
from demucs.demucs import Demucs
ModuleNotFoundError: No module named 'demucs'

Epoch counter does not resume when resuming from start checkpoint.

It seems to reset to 0 every time

Train / Valid folder structure for MUSDB18

Hi,
First, thanks for the code.
If I understand correctly, creating appropriate train/val datasets require to provide a dataset_type (for training songs) and valid_path (for validation songs). In other words, the expected folder structure is (e.g., for a dataset type 1 such as MUSDB):

dataset/train/song 1/{ vocals.wav, bass.wav, drums.wav, other.wav}
dataset/train/song 2/{ vocals.wav, bass.wav, drums.wav, other.wav}
...........
dataset/valid/song 1/{ vocals.wav, bass.wav, drums.wav, other.wav}
dataset/valid/song 2/{ vocals.wav, bass.wav, drums.wav, other.wav}
...........
dataset/test/song 1/{ vocals.wav, bass.wav, drums.wav, other.wav}
dataset/test/song 2/{ vocals.wav, bass.wav, drums.wav, other.wav}
...........

However, this is not the structure of MUSDB, which only contains a train and test subfolders: the songs used for validation are part of the train folder, and a fixed list of validation tracks is provided by the authors for reproducibility.

Therefore, in order to reproduce some papers' results using MUSDB18 (e.g., demucs, BSRNN), should we pre-process the dataset by manually creating a folder split? Or is there something I missed?

Thanks.
Paul

PS: one additional minor thing, I think an augmentation.mixup=True/False entry should be added in the config_musdb18_mel_band_roformer, config_musdb18_htdemucs, and config_musdb18_bs_roformer files, otherwise there's a "Missing key" error when instantiating these.

Inquiry about Default Instrument Settings and Dual Loudness Augmentation in bs_roformer

Hello, I am currently analyzing your code in an attempt to reproduce the paper performance of bs_roformer. While examining the code, I have come across a few points that I am curious about and would like to inquire further.

Upon reviewing the settings related to bs_roformer, I noticed that the configuration predominantly uses only vocals and other in the instruments setup. Generally, MSS datasets like MUSDB18 are composed of a 4-stem setup: [vocals, drums, bass, other], and the mixture is a combination of these four stems. However, your default setting is [vocals, other], and using this setting results in a mixture composed only of vocals + other. I am curious whether this configuration is an error or if there was a specific task intended for this particular setup.

Additionally, while examining the code, I noticed that loudness augmentation is applied once during the load_random_mix function when performing mixup, and again in getitem. I would like to clarify whether applying loudness augmentation twice is by design, a misunderstanding on my part, or a coding error.

I would appreciate your response to these two questions.

Thank you.

Possibility of real time vocal separation?

Thank you @ZFTurbo for providing the training code. I have tested out UVR MDX-Net models and found them to be very good. However, i am unable to get the models to work in real time. I wonder if the models provided here are capable of vocal separation in real time if i were to train them, or do i have to make some kind of adjustments to their architecture?