Giter VIP home page Giter VIP logo

Comments (5)

anton-l avatar anton-l commented on May 19, 2024 1

Hi @raghav-menon !
My experience with low-resource languages was quite similar to yours (increasing validation loss + decreasing WER). I'm speculating that this is due to overfitting to high-frequency words.

I found that these hyperparameters work pretty well for most languages in CommonVoice:
attention_dropout=0.2, activation_dropout=0.1, hidden_dropout=0.05, final_dropout=0.1, feat_proj_dropout=0.05, mask_time_prob=0.05, layerdrop=0.04, learning_rate=3e-4 + batch_size=128 (can be obtained e.g. with batch_size=16, gradient_accumulation_steps=8)

Also these augmentations from Audiomentations gave a bit more stability:
AddGaussianNoise(min_amplitude=0.0001, max_amplitude=0.005, p=0.2) and PitchShift(min_semitones=-1, max_semitones=2, p=0.2)

Keeping shorter utterances should not be a problem too. It's more important to catch any incorrectly transcribed clips, since they can greatly destabilize the CTC loss (if you get inf loss with ctc_zero_infinity=False, then it's likely that) .

from blog.

anton-l avatar anton-l commented on May 19, 2024 1

@raghav-menon the test data followed the same WER dynamic as the validation one.

I also had very limited success working with noisy speech (youtube & radio). With a frozen feature encoder the model stopped converging at around 40 WER even with 100s of hours of speech.

At the moment, the most promising pretrained model for noisy speech is Wav2Vec-Robust (https://huggingface.co/facebook/wav2vec2-large-robust), which may or may not work for you, since the training data for it was English-only.

from blog.

patrickvonplaten avatar patrickvonplaten commented on May 19, 2024

Hey @raghav-menon,

Did you play around with the hyper-parameters a bit to see what works / doesn't work well? One important thing to notice with facebook/wav2vec2-large-xlsr-53 is that it was pretrained on read-out audio data meaning that the data was quite clean. Is this also the case for you dataset?

Also I would definitely keep utterances that are <4s. I usually filter only utterances that are shorter than <1s (or even keep those as well).

It's very normal that the validation loss goes up again where as the WER continues going down.

In terms of hyperparameters, I would try to play around with the learning_rate, batch_size and hidden_dropout -> those seem to be quite correlated to the final WER. Here is a nice graphic about hyperparameters for the fine-tuning the model in Turkish: https://wandb.ai/wandb/xlsr/sweeps/p23j88jo?workspace=user-borisd13

Also it might help quite a bit to use data-augmentation (@anton-l - do you maybe have a link to some good techniques here?)

from blog.

raghav-menon avatar raghav-menon commented on May 19, 2024

Hello @patrickvonplaten,

Thank you for your response. I did play around with the hyperparameters. A slight deviation from the value given the WER remains at 1 throughout. So not much of a progress on that end. The data is real time data from radio transmission recording and hence not of studio quality.

I will, as you have mentioned, filter out the ones which are <1s and keep the rest and try it. I had also tried pretraining wav2vec2 with untranscribed data but looks like even the colab pro memory is not enough.

I will let you know how it progresses.

Thanks.

Regards,
Raghav

from blog.

raghav-menon avatar raghav-menon commented on May 19, 2024

Hello @anton-l,

Thanks for your suggestions. How did the trained model fare with the test data when you experienced increasing validation loss and decreasing WER. Just curious. I did not bother to run the model on the test data as model final WER was 76% and far worse than my HMM-GMM where I obtained a 60 WER. The best WER I had obtained for this data was with a TDNN architecture and it was around 50 were I had included a little bit of Self-supervised learning as well. Just to let you know my data is not studio quality as these are real time radio transmission recordings. I am wondering what the impact of noise is on the wav2vec2 feature extractor as it is a huge difference in WER.

I will indeed try out your suggestions and let you know.

Thanks.

Regards,
Raghav

from blog.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.