Giter VIP home page Giter VIP logo

qoul's Introduction

qoul

Qoul is an experimental Automatic Speech Recognition system built using Kaldi. While it is understood that end-to-end ASRs give the best bang for the buck, an HMM-GMM based approach is actually quite auspicious for low-resource languages. There have been reports of people achieving appreciable WER (word error rates) by just using ~20-50 hours of speech data.

Learn More

To grok the underlying working of an HMM-GMM based ASR, the following three articles are recommended.

Kaldi: A Frustration

As much as Kaldi is hailed as an outstanding ASR toolkit, it has one major problem: the documentation is indecipherable and perplexing for beginners. It takes hours upon hours of reading only to realize you're getting nowhere. Yes, there are some good tutorials but they are devoid of any explanation and only list out a series of steps to follow. This repository is an effort in that vein and I hope it'll save quite a few headaches for some.

Speech Data

We used 3 different speech corpora. One was developed by us at FAST-NUCES while the other two were publicly available. Namely: CSALT ITU Corpus & RUMI Corpus. We do not own the FAST-NUCES corpus and therefore have removed it before publishing this repository. You can still make use of this repository with the public corpora.

Explanation

Kaldi works best with a specific directory structure with specific files at specific places. Here's the structure that we're using:

.
├── audio
│   ├── csalt
│   ├── csalt_transcription.txt
│   ├── rumi
│   └── rumi_transcription.txt
├── conf
│   └── mfcc.conf
│   ├── decode.config
├── data
│   ├── local
│   │   ├── corpus.txt
│   │   └── lang
│   │       └── lexicon.txt
│   ├── test
│   └── train
├── local
│   └── score.sh
├── cmd.sh
├── path.sh
├── prep.sh
├── mfcc_cmvn.sh
├── mono.sh
├── tri1.sh
├── tri2.sh
└── tri3.sh
├── sgmm2.sh
├── mmi_sgmm2.sh
├── qoul.ipynb

audio - Contains our speech data in form of wav files sampled at 16000Hz. Also contains the transcription files which map each wav file to their corresponding text counterpart. audio/csalt & audio/rumi are placeholder directories in which you should place the wav files from the public download links in the above section. Refer below for further instructions.

conf - Contains configuration files. By default it contains mfcc.conf & decode.config which contain config parameters for MFCC feature extraction and decoding processes.

data/local - Contains corpus.txt, a giant corpus of Urdu sentences used to build our language model (using SRILM). data/local/lang/lexicon.txt is a file that maps each and every word in our speech data to a phonetic transcription.

local - Contains score.sh, a script used in decoding. It is copied from the wsj example provided by Kaldi.

cmd.sh & path.sh - Contains some tunable parameters and paths for the following scripts.

prep.sh - Validates the data directory and creates a language model.

mono.sh, tri1.sh, tri2.sh, tri3.sh, sgmm2.sh, mmi_sgmm2.sh - Scripts to train different ASR models.

qoul.ipynb - A notebook that you can run to train a Kaldi model. Refer below for further instructions.

NOTE: A plethora of directories and files will be created when training Kaldi. The above structure is just a preset required by Kaldi.

Instructions (Ubuntu 16.04+)

Installing Dependencies

sudo apt install automake autoconf sox libtool subversion gawk

Installing Kaldi

You need to download and compile/install kaldi. (Psst, it takes awfully long). Follow this link or just execute the commands below.

git clone https://github.com/kaldi-asr/kaldi.git
cd kaldi/tools
extras/install_mkl.sh
extras/check_dependencies.sh
make -j 4

If extras/check_dependencies.sh complains about any missing dependencies, fix them first. Next use nproc to find out the number of processors you have and supply a reasonable amount to make as the -j flag just like above.

cd ../src/
./configure
make depend -j 4
make -j 4

As mentioned above, supply a reasonable amount of processors according to your hardware to make commands.

Installing SRILM

We're using SRILM to build our language model. By default, we build a 3-gram model. You may tune it by the lm_order variable in cmd.sh.

Download srilm.tgz available in this repository and copy it to kaldi/tools. Then cd into kaldi/tools and run the following script.

./install_srilm.sh

Cloning Qoul

Make sure to cd into kaldi/egs and run the following:

git clone [email protected]:parkerqueen/qoul.git
cd qoul

Downloading Speech Data

The transcription files for both the corpora are already placed in the repository. You just need to place the wav files as:

  • CSALT:
    • Visit the download link to download the corpus. After extraction, you shall find the wav files at Recordings-Continuous/wav. Copy all of them into audio/csalt.
  • RUMI:
    • Download this google drive folder and copy all the directories inside Corpus/Recordings into audio/rumi.

Running Notebook

Launch and run the notebook qoul.ipynb to train different ASR models for Urdu.

Running the Model

You may make use of this repository published by a friend of mine. One needs ample understanding of Kaldi mechanics to actually transcribe an audio file using a pre-trained model. If you want a quick and dirty way, however, do yourself a favor and use the repository I linked.

Evaluation

Note that the following scores are for a model trained on the 3 corpora mentioned above and you scores might differ.

Model Minimum WER Minimum SER
Monophone 33.24% 64.26%
Tri-1 16.70% 52.41%
Tri-2 16.74% 52.80%
Tri-3 13.63% 52.62%
SGMM2 N/A N/A
MMI_SGMM2 N/A N/A

After training our monophone system, we were slightly disappointed considering we had about 100 hours of data and there have previously been reports of Kaldi models achieving less WER on much tinier corpora. Yet, we took consolation in the fact that we were able to achieve something, despite the depression episodes kaldi led us into.

Tri-1 was a huge, mammoth, gargantuan improvement over the monophone system. We saw almost 50% reduction in the WER, while the SER saw a drop to ~52%. Tri-2 seemed to perform on par with Tri-1 and wasn't much of an improvement. Tri-3 took our WER to ~13% which is the lowest we've encountered so far, however, there wasn't much of a drop in the SER. To clarify, all of the WERs and SERs mentioned here should be read as "Minimum (S/W)ER".

Contributors

Acknowledgements

qoul's People

Contributors

parkerqueen avatar

Stargazers

 avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.