I read your impressive paper and thank you for releasing the training script. I am trying to reproduce the results on DSing (train30) but I encounter some problems.
My training gets overfitting quickly. I compared the train_log.txt and found that the training losses are in the same range as yours, but my validation losses and WER/CERs, are much higher. I guess that's why the lr scheduler reduces the lr faster than expected, and leads to the overfitting problem. Below is my training log for the fine-tune experiment:
epoch: 1, lr_model: 3.00e-04, lr_wav2vec: 1.00e-05 - train loss: 1.55 - valid loss: 1.23, valid ctc_loss: 2.15, valid seq_loss: 1.00, valid CER: 33.66, valid WER: 49.62
epoch: 2, lr_model: 3.00e-04, lr_wav2vec: 1.00e-05 - train loss: 1.30 - valid loss: 1.33, valid ctc_loss: 2.73, valid seq_loss: 9.79e-01, valid CER: 62.26, valid WER: 93.71
epoch: 3, lr_model: 2.40e-04, lr_wav2vec: 9.00e-06 - train loss: 1.22 - valid loss: 1.46, valid ctc_loss: 3.29, valid seq_loss: 1.00, valid CER: 90.94, valid WER: 1.45e+02
epoch: 4, lr_model: 1.92e-04, lr_wav2vec: 8.10e-06 - train loss: 1.18 - valid loss: 1.47, valid ctc_loss: 3.54, valid seq_loss: 9.49e-01, valid CER: 99.51, valid WER: 1.55e+02
epoch: 5, lr_model: 1.54e-04, lr_wav2vec: 7.29e-06 - train loss: 1.15 - valid loss: 1.39, valid ctc_loss: 3.18, valid seq_loss: 9.40e-01, valid CER: 82.79, valid WER: 1.19e+02
First, I thought there is something wrong with my dev set. I tried inferencing on my dev and test set using the checkpoint you provide, and it gives a WER/CER similar to what you reported. Now I am confused and want to ask for help. Any insights would be appreciated.
I prepared my dev set using my own script and it should be doing the same thing as the Kaldi recipe, except that some problematic files are excluded. I ended up having 408 songs, which is a subset of the standard 482 songs.