Hello Patrick <a class="user-mention notranslate" data-hovercard-type="user" data-hove

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hey <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

Hello <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-ur

finetuning on own dataset about blog HOT 5 OPEN

huggingface commented on May 19, 2024

finetuning on own dataset

from blog.

Comments (5)

anton-l commented on May 19, 2024 1

Hi @raghav-menon !
My experience with low-resource languages was quite similar to yours (increasing validation loss + decreasing WER). I'm speculating that this is due to overfitting to high-frequency words.

I found that these hyperparameters work pretty well for most languages in CommonVoice:
attention_dropout=0.2, activation_dropout=0.1, hidden_dropout=0.05, final_dropout=0.1, feat_proj_dropout=0.05, mask_time_prob=0.05, layerdrop=0.04, learning_rate=3e-4 + batch_size=128 (can be obtained e.g. with batch_size=16, gradient_accumulation_steps=8)

Also these augmentations from Audiomentations gave a bit more stability:
AddGaussianNoise(min_amplitude=0.0001, max_amplitude=0.005, p=0.2) and PitchShift(min_semitones=-1, max_semitones=2, p=0.2)

Keeping shorter utterances should not be a problem too. It's more important to catch any incorrectly transcribed clips, since they can greatly destabilize the CTC loss (if you get inf loss with ctc_zero_infinity=False, then it's likely that) .

from blog.

anton-l commented on May 19, 2024 1

@raghav-menon the test data followed the same WER dynamic as the validation one.

I also had very limited success working with noisy speech (youtube & radio). With a frozen feature encoder the model stopped converging at around 40 WER even with 100s of hours of speech.

At the moment, the most promising pretrained model for noisy speech is Wav2Vec-Robust (https://huggingface.co/facebook/wav2vec2-large-robust), which may or may not work for you, since the training data for it was English-only.

from blog.

patrickvonplaten commented on May 19, 2024

Hey @raghav-menon,

Did you play around with the hyper-parameters a bit to see what works / doesn't work well? One important thing to notice with facebook/wav2vec2-large-xlsr-53 is that it was pretrained on read-out audio data meaning that the data was quite clean. Is this also the case for you dataset?

Also I would definitely keep utterances that are <4s. I usually filter only utterances that are shorter than <1s (or even keep those as well).

It's very normal that the validation loss goes up again where as the WER continues going down.

In terms of hyperparameters, I would try to play around with the learning_rate, batch_size and hidden_dropout -> those seem to be quite correlated to the final WER. Here is a nice graphic about hyperparameters for the fine-tuning the model in Turkish: https://wandb.ai/wandb/xlsr/sweeps/p23j88jo?workspace=user-borisd13

Also it might help quite a bit to use data-augmentation (@anton-l - do you maybe have a link to some good techniques here?)

from blog.

raghav-menon commented on May 19, 2024

Hello @patrickvonplaten,

Thank you for your response. I did play around with the hyperparameters. A slight deviation from the value given the WER remains at 1 throughout. So not much of a progress on that end. The data is real time data from radio transmission recording and hence not of studio quality.

I will, as you have mentioned, filter out the ones which are <1s and keep the rest and try it. I had also tried pretraining wav2vec2 with untranscribed data but looks like even the colab pro memory is not enough.

I will let you know how it progresses.

Thanks.

Regards,
Raghav

from blog.

raghav-menon commented on May 19, 2024

Hello @anton-l,

Thanks for your suggestions. How did the trained model fare with the test data when you experienced increasing validation loss and decreasing WER. Just curious. I did not bother to run the model on the test data as model final WER was 76% and far worse than my HMM-GMM where I obtained a 60 WER. The best WER I had obtained for this data was with a TDNN architecture and it was around 50 were I had included a little bit of Self-supervised learning as well. Just to let you know my data is not studio quality as these are real time radio transmission recordings. I am wondering what the impact of noise is on the wav2vec2 feature extractor as it is a huge difference in WER.

I will indeed try out your suggestions and let you know.

Thanks.

Regards,
Raghav

from blog.

finetuning on own dataset about blog HOT 5 OPEN

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent