Can not reproduce the results about cmgan HOT 12 CLOSED

ruizhecao96 commented on August 24, 2024

Can not reproduce the results

from cmgan.

Comments (12)

WenbingWei commented on August 24, 2024

Hi, I can't reproduce his results either. How much PESQ can you reproduce so far?

from cmgan.

hbwu-ntu commented on August 24, 2024

Very low, only 3.2. Far more lower than the paper. By the way, how much PESQ can you reproduce?

from cmgan.

WenbingWei commented on August 24, 2024

I am trained according to the parameters published in the code, and the data processing is in accordance with the way in the TSTNN code. At 50epoch, the PESQ is 3.24. Then gen_loss starts to rise, and the model does not converge.

from cmgan.

hbwu-ntu commented on August 24, 2024

Thanks for making the implementation details clear. It seems our results are similar. Do you use the same way as the CMGAN github repo to downsample the data to 16k?

from cmgan.

WenbingWei commented on August 24, 2024

No, I downsampled the original Voice Bank+DEMAND data to 16K. I think the results in the paper cannot be reproduced because of the hyperparameter and the learning rate.

from cmgan.

hbwu-ntu commented on August 24, 2024

Apart from the learning rate, which other hyperparameters do you consider crucial for reproducing the results?
I recall that the authors address the learning rates in the paper. But from my experience, I can not reproduce the results using that learning rate.

from cmgan.

WenbingWei commented on August 24, 2024

I think the learning rate in this code is more suitable for speech separation, such as the classic conv-tasnet. I think the hyperparameter of loss is also very important, but I don't know how much weight to assign to each loss can achieve the optimal value.

from cmgan.

SherifAbdulatif commented on August 24, 2024

Sorry for that, but this was never an issue for us and also we didn't get this complain on PESQ. Did you try to use the checkpoint in src/best_ckpt?

from cmgan.

SherifAbdulatif commented on August 24, 2024

Hi! Your paper and code are excellent! I have learned a lot about speech enhancement from the paper, and I find your code to be very well-structured and clear. Thank you so much!

I can not reproduce the results in your paper. I just want to know some settings to run the experiments

about the loss_weights, do you use the setting in your paper or the setting in your github?

about the epoch number, do you use 50 in the paper or 120 in the github repo?

how do you select the final model for inference?

why you set the utterance length as 16 * 16000 during testing

How do you downsample the audio, could you share the script?

For reproducing you can use the checkpoint in src/best_ckpt
Weights are the same as paper, you can find more details here
When the loss saturates, sometimes it can happen between 50 to 75 epochs
In testing the length is variable, but 10 is the maximum time that can run on our GPU. Otherwise we need to split the track.
We already saved as downsampled, you can download from here https://drive.google.com/file/d/1pGV79T3k030f6uc2SbUpuNhfovtmLJxN/view?usp=sharing
Alternatively we used the librosa which follow the same downsampler as torch:
import librosa
audio_down, sr = librosa.load(audio_path, sr=16000)

from cmgan.

hbwu-ntu commented on August 24, 2024

Thank you very much for your warm and detailed response. I will follow the instructions provided and make an effort to reproduce the results:

download the data as you mentioned in the response 5
use the same weights as in the response 2
test all the checkpoints between 50 to 75 epochs as in the response 2.

By the way, I have some follow-up questions:

During testing, why you set the length as variable? Fixing the batch size as 1 for testing won't encounter any GPU OOM issues.
I downloaded your checkpoint from src/best_ckpt and conducted testing. The numbers are slightly worse but is close with the results mentioned in your paper. I'm attempting to reproduce the results. Have you conducted experiments by running multiple trials and calculating the mean and variance of performance? This would help ensure that the positive results are not solely due to good initializations.

from cmgan.

SherifAbdulatif commented on August 24, 2024

Variable length would avoid any normalization issues when splitting the tracks and it is much more convenient than padding tracks to a predefined maximum length or splitting tracks exceeding this length.
No actually not the results in the paper are from the best checkpoint not multiple trials, however, your point is a very interesting insight and should be involved in our future studies.
Thanks!

from cmgan.

SherifAbdulatif commented on August 24, 2024

However, it is worth mentioning that based on several training trials the results are somehow consistent.

from cmgan.

Can not reproduce the results about cmgan HOT 12 CLOSED

Comments (12)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent