Giter VIP home page Giter VIP logo

vosk's Introduction

For Kaldi API for Android and Linux please see Vosk API. This is a server project.

This is Vosk, the lifelong speech recognition system.

Concepts

As of 2019, the neural network based speech recognizers are pretty limited in terms of amount of the speech data they can use in training and require enormous computing power and time to train and optimize the parameters. Neural networks have problems with human-like one shot learning, their decisions are not very robust to unseen conditions and hard to understand and correct.

That is why we decided to build a system based on large signal database concept. We apply audio fingerprinting scheme. The audio is segmented on chunks, the chunks are stored in the database based on LSH hash value. During decoding we simply lookup the chunks in the database to get the idea what are the possible phones. That helps us to make a proper decision on decoding results.

The advantages of this approach are:

  • We can quickly train on 100000 hours of speech data on very simple hardware
  • We can easily correct recognizer behavior just by adding samples
  • We can make sure that recognition result is correct because it is sufficiently represented in the training dataset
  • We can parallelize training across thousands of nodes
  • We support lifelong learning paradigm
  • We can use this method together with more common neural network training to improve recognition accuracy
  • The system is robust against noise

The disandvantages are:

  • The index is really huge, it is not expected to fit a memory of single server
  • The generalization capabilities of the model are quite questionable, at the same time the generalization capabilities of the neural networks are also questionable.
  • For now the segmentation requires conventional ASR, but in the future we might segment ourselves.

The nice to have things in the future would be:

  • Multilingual training
  • Our own segmentation
  • The tool to reduce the model to fit the mobile
  • Specialized hardware to implement this AI paradigm

Usage

To install the requirements run

pip3 install -r requirements.txt

To prepare the training/verification data create the following two files:

  • wav.scp list to map uterances to wav files in filesystem
  • phones.txt the CTM file with phonemes and timings. It could be CTM file from the alignment or it could be a CTM file from the decoding

You can create them with Kaldi ASR toolkit

Indexing

To add the data to the database run

python3 index.py wavs-train.txt phones-train.txt data.idx

That will add the data to the database data.idx or create a new one

Verification

To verify decoding results run

python3 verify.py wavs-test.txt phones-test.txt data.idx

The tool will search for segments in the index and report suspicious segments which you can additionally check and later add to the database to improve the accuracy of recognition.

Related papers and links

vosk's People

Contributors

nshmyrev avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.