Giter VIP home page Giter VIP logo

spoken-command-recognition's Introduction

Sound Commander

A large database of free audio samples (10M words), a test bed for voice activity detection algorithms and for recognition of syllables (single-word commands).

Background

<begin rant>

Open source speech recognition is crap, for two reasons.

Reason 1: Because it tries to do everything:

soundwaves -> phonemes -> textual representation -> meaning

Why not solve a simpler problem?

I do not need to have my computer "translate" sounds into text, or "understand" a meaning. Why should we insist on having a human-human interaction with a computer? We use mice and keyboards, they are not "native language" for human-human interaction. Why not also for speech interaction with computers, define and learn a format that is easy for a computer?

I just want to tell my computer a command and it does something. So I only need:

soundwaves -> label

Reason 2: It is hard to enter for programmers because training sets are limited and/or hard to obtain. Machine learning algorithms need a vast amount of data to be successful and freely available datasets just aren't there.

<end rant>

Approach

In this repo I want to experiment with some machine learning algorithms. There are four parts

I: Generating data sets

The approach is to build commands off the simplest utterances: single-syllable verbs.

How to create a large training set? With text-to-speech programs and varying noise sources. Whether well-trained methods can translate to the real world will teach us something about knowledge transfer. First we create the clean utterances. A very large labeled test training set is generated (generate.sh) by a text-to-speech program (espeak). For simplicity we are working with phonemes instead of words. I selected all english single-syllable word as labels, obtained their phonemes (thanks to BEEP), and generate audio files varying

  • speaker
  • stress
  • pitch
  • speed
  • (and further down, noise level and noise type)

For each word there are 1500-2000 samples. There are 5153 words in total, 558 in the "verbs" subset.

I pass the audio through OPUS (formerly speex) with a low constant bitrate. Because this compression (opus was formerly speex) is optimized for speech, hopefully it filters out non-speech to some degree. More importantly, it saves disk space.

II: Detecting voice activity

Training set: We generate a very long stream of audio by stitching together random word audio samples (quiet times cropped) and random lengths of silence (with varying noise properties). A labelling stream should be generated at the same time, identifying silent times.

Build with:

python trainactivitydetect_prepare.py  verbs.dog db.dog db.dog.hdf5
python trainactivitydetect_prepare.py  verbs db db.verbs.hdf5

Short-time Fourier Transforms (STFT) should be a good input format for algorithms.

Methods:

  • Any supervised classification machine learning method can be trained. The method should react as soon as possible however (Recurrent neural networks come to mind).
  • The alternative is voice activity detection algorithms. For example implemented in Speex, WebRTC, https://github.com/voixen/voixen-vad.

Testing algorithms with:

$ python trainactivitydetect_cv.py out.voiceactivity?-???.hdf5
...
memory from the last 280 frames
loading... out.voiceactivity0-0.1.hdf5
   (142763, 65) {u'noisesource': 'airport'}
  attaching memory ...
    selected 200 200
  attaching memory done.
loading... out.voiceactivity0-0.3.hdf5
   (142763, 65) {u'noisesource': 'airport'}
  attaching memory ...
    selected 200 200
  attaching memory done.
loading... out.voiceactivity0-0.5.hdf5
   (142763, 65) {u'noisesource': 'airport'}
  attaching memory ...
    selected 200 200
  attaching memory done.
loading... out.voiceactivity0-1.0.hdf5
   (142763, 65) {u'noisesource': 'airport'}
  attaching memory ...
    selected 200 200
  attaching memory done.
loading... out.voiceactivity1-0.1.hdf5
   (165485, 65) {u'noisesource': 'babble'}
  attaching memory ...
    selected 200 200
  attaching memory done.
loading... out.voiceactivity1-0.3.hdf5
   (165485, 65) {u'noisesource': 'babble'}
  attaching memory ...
    selected 200 200
  attaching memory done.
loading... out.voiceactivity1-0.5.hdf5
   (165485, 65) {u'noisesource': 'babble'}
  attaching memory ...
    selected 200 200
  attaching memory done.
loading... out.voiceactivity1-1.0.hdf5
   (165485, 65) {u'noisesource': 'babble'}
  attaching memory ...
    selected 200 200
  attaching memory done.
loading... out.voiceactivity2-0.1.hdf5
   (152073, 65) {u'noisesource': 'brown'}
  attaching memory ...
    selected 200 200
  attaching memory done.
loading... out.voiceactivity2-0.3.hdf5
   (152073, 65) {u'noisesource': 'brown'}
  attaching memory ...
    selected 200 200
  attaching memory done.
loading... out.voiceactivity2-0.5.hdf5
   (152073, 65) {u'noisesource': 'brown'}
  attaching memory ...
    selected 200 200
  attaching memory done.
loading... out.voiceactivity2-1.0.hdf5
   (152073, 65) {u'noisesource': 'brown'}
  attaching memory ...
    selected 200 200
  attaching memory done.
loading... out.voiceactivity3-0.1.hdf5
   (153851, 65) {u'noisesource': 'car'}
  attaching memory ...
    selected 200 200
  attaching memory done.
loading... out.voiceactivity3-0.3.hdf5
   (153851, 65) {u'noisesource': 'car'}
  attaching memory ...
    selected 200 200
  attaching memory done.
loading... out.voiceactivity3-0.5.hdf5
   (153851, 65) {u'noisesource': 'car'}
  attaching memory ...
    selected 200 200
  attaching memory done.
loading... out.voiceactivity3-1.0.hdf5
   (153851, 65) {u'noisesource': 'car'}
  attaching memory ...
    selected 200 200
  attaching memory done.
loading... out.voiceactivity4-0.1.hdf5
   (153975, 65) {u'noisesource': 'exhibition'}
  attaching memory ...
    selected 200 200
  attaching memory done.
loading... out.voiceactivity4-0.3.hdf5
   (153975, 65) {u'noisesource': 'exhibition'}
  attaching memory ...
    selected 200 200
  attaching memory done.
loading... out.voiceactivity4-0.5.hdf5
   (153975, 65) {u'noisesource': 'exhibition'}
  attaching memory ...
    selected 200 200
  attaching memory done.
loading... out.voiceactivity4-1.0.hdf5
   (153975, 65) {u'noisesource': 'exhibition'}
  attaching memory ...
    selected 200 200
  attaching memory done.
loading... out.voiceactivity5-0.1.hdf5
   (146957, 65) {u'noisesource': 'ocean'}
  attaching memory ...
    selected 200 200
  attaching memory done.
loading... out.voiceactivity5-0.3.hdf5
   (146957, 65) {u'noisesource': 'ocean'}
  attaching memory ...
    selected 200 200
  attaching memory done.
loading... out.voiceactivity5-0.5.hdf5
   (146957, 65) {u'noisesource': 'ocean'}
  attaching memory ...
    selected 200 200
  attaching memory done.
loading... out.voiceactivity5-1.0.hdf5
   (146957, 65) {u'noisesource': 'ocean'}
  attaching memory ...
    selected 200 200
  attaching memory done.
loading... out.voiceactivity6-0.1.hdf5
   (141459, 65) {u'noisesource': 'pink'}
  attaching memory ...
    selected 200 200
  attaching memory done.
loading... out.voiceactivity6-0.3.hdf5
   (141459, 65) {u'noisesource': 'pink'}
  attaching memory ...
    selected 200 200
  attaching memory done.
loading... out.voiceactivity6-0.5.hdf5
   (141459, 65) {u'noisesource': 'pink'}
  attaching memory ...
    selected 200 200
  attaching memory done.
loading... out.voiceactivity6-1.0.hdf5
   (141459, 65) {u'noisesource': 'pink'}
  attaching memory ...
    selected 200 200
  attaching memory done.
loading... out.voiceactivity7-0.1.hdf5
   (138663, 65) {u'noisesource': 'restaurant'}
  attaching memory ...
    selected 200 200
  attaching memory done.
loading... out.voiceactivity7-0.3.hdf5
   (138663, 65) {u'noisesource': 'restaurant'}
  attaching memory ...
    selected 200 200
  attaching memory done.
loading... out.voiceactivity7-0.5.hdf5
   (138663, 65) {u'noisesource': 'restaurant'}
  attaching memory ...
    selected 200 200
  attaching memory done.
loading... out.voiceactivity7-1.0.hdf5
   (138663, 65) {u'noisesource': 'restaurant'}
  attaching memory ...
    selected 200 200
  attaching memory done.
loading... out.voiceactivity8-0.1.hdf5
   (139677, 65) {u'noisesource': 'street'}
  attaching memory ...
    selected 200 200
  attaching memory done.
loading... out.voiceactivity8-0.3.hdf5
   (139677, 65) {u'noisesource': 'street'}
  attaching memory ...
    selected 200 200
  attaching memory done.
loading... out.voiceactivity8-0.5.hdf5
   (139677, 65) {u'noisesource': 'street'}
  attaching memory ...
    selected 200 200
  attaching memory done.
loading... out.voiceactivity8-1.0.hdf5
   (139677, 65) {u'noisesource': 'street'}
  attaching memory ...
    selected 200 200
  attaching memory done.
loading... out.voiceactivity9-0.1.hdf5
   (142587, 65) {u'noisesource': 'subway'}
  attaching memory ...
    selected 200 200
  attaching memory done.
loading... out.voiceactivity9-0.3.hdf5
   (142587, 65) {u'noisesource': 'subway'}
  attaching memory ...
    selected 200 200
  attaching memory done.
loading... out.voiceactivity9-0.5.hdf5
   (142587, 65) {u'noisesource': 'subway'}
  attaching memory ...
    selected 200 200
  attaching memory done.
loading... out.voiceactivity9-1.0.hdf5
   (142587, 65) {u'noisesource': 'subway'}
  attaching memory ...
    selected 200 200
  attaching memory done.
data: (15959, 2624) (15959,)

with preprocessor log:

0.87 RandomForest1 (training speed: 15.6s)
confusion matrix: (eval speed: 0.94s)
[[2384  293]
 [ 169 2474]]
ROC curve plot...
5% FPR: at threshold 2.0 with efficiency 0.0
1% FPR: at threshold 2.0 with efficiency 0.0

0.94 RandomForest4 (training speed: 45.4s)
confusion matrix: (eval speed: 0.98s)
[[2564  113]
 [ 159 2484]]
ROC curve plot...
5% FPR: at threshold 0.75 with efficiency 93.98%
1% FPR: at threshold 1.0 with efficiency 80.85%

0.94 RandomForest10 (training speed: 106.0s)
confusion matrix: (eval speed: 1.38s)
[[2583   94]
 [ 106 2537]]
ROC curve plot...
5% FPR: at threshold 0.6 with efficiency 95.99%
1% FPR: at threshold 0.8 with efficiency 89.94%

0.94 RandomForest40 (training speed: 326.7s)
confusion matrix: (eval speed: 2.95s)
[[2613   64]
 [  66 2577]]
ROC curve plot...
5% FPR: at threshold 0.45 with efficiency 98.45%
1% FPR: at threshold 0.625 with efficiency 95.95%


0.94 AdaBoost (training speed: 1808.8s)
confusion matrix: (eval speed: 8.24s)
[[2570  107]
 [ 102 2541]]
ROC curve plot...
5% FPR: at threshold 0.496241695927 with efficiency 96.78 %
1% FPR: at threshold 0.51368077231 with efficiency 93.15 %

0.93 GradientBoosting (training speed: 3338.6s)
confusion matrix: (eval speed: 1.06s)
[[2522  155]
 [  76 2567]]
ROC curve plot...
5% FPR: at threshold 0.536277698584 with efficiency 96.78 %
1% FPR: at threshold 0.764611974148 with efficiency 92.47 %

0.78 MLP2 (training speed: 238.6s)
confusion matrix: (eval speed: 1.97s)
[[2498  179]
 [ 196 2447]]
ROC curve plot...
5% FPR: at threshold 0.52487468149 with efficiency 91.87 %
1% FPR: at threshold 0.634618783147 with efficiency 86.76 %

0.91 MLP10 (training speed: 306.5s)
confusion matrix: (eval speed: 3.29s)
[[2671    6]
 [ 545 2098]]
ROC curve plot...
5% FPR: at threshold 0.269476241912 with efficiency 90.39 %
1% FPR: at threshold 0.361535546018 with efficiency 85.74 %

0.91 MLP40 (training speed: 724.1s)
confusion matrix: (eval speed: 7.18s)
[[2538  139]
 [ 198 2445]]
ROC curve plot...
5% FPR: at threshold 0.506125804557 with efficiency 92.40 %
1% FPR: at threshold 0.641257552208 with efficiency 87.44 %

with preprocessor logscale:

0.69 MLP2 (training speed: 256.3s)
confusion matrix: (eval speed: 2.26s)
[[ 225 2452]
 [   0 2643]]
ROC curve plot...
5% FPR: at threshold 1.52995962594 with efficiency 0.0 %
1% FPR: at threshold 1.52995962594 with efficiency 0.0 %

0.92 MLP10 (training speed: 476.6s)
confusion matrix: (eval speed: 3.61s)
[[2624   53]
 [ 315 2328]]
ROC curve plot...
5% FPR: at threshold 0.432683623862 with efficiency 91.49 %
1% FPR: at threshold 0.546176133175 with efficiency 86.11 %

0.94 MLP40 (training speed: 644.6s)
confusion matrix: (eval speed: 6.99s)
[[2653   24]
 [ 370 2273]]
ROC curve plot...
5% FPR: at threshold 0.376385485975 with efficiency 91.56 %
1% FPR: at threshold 0.496414547621 with efficiency 86.08 %

In conclusion, Random Forest (40 classifiers) does a good job. For low false-positive rates, need to set probability score threshold to >0.625,

MLP also has a comparable quality.

III: Detecting words (TODO)

Training set: Generated in Part I, but should be STFT.

Methods:

  • Any supervised classification machine learning method can be trained. Convolutional neural networks could be useful.

IV: Putting it together (TODO)

The best method of Part II should detect voice; the audio segment can then be isolated by cropping before and after. The best method of Part III can then be applied to this audio piece, identifying the spoken word. Finally, the command associated with that label can be executed.

For example, the computer could just be saying the identified word back, or run a program, shut down, change the speaker volume, etc.

How to use

Generating the dataset

Use generateverbs.sh to generate the sound files for the "verbs" subset:

$ bash generateverbs.sh

This will create files like db.verbs/<word>/<variant> where for example word="aim" and variant="132" is one pronounciation of that word. These are opus sound files. This takes a while (10 minutes per word). You need: opusenc, espeak, mbrola and sox. The size of the database will be approximately 10MB per word.

To check a example pronounciation, play a random word like this:

$ w=$(ls db|sort -R|head -n1); p=$(ls db/$w|sort -R|head -n1); play db/$w/$p

Some words are pretty hard to understand, some are quite easy. Don't think of these classes as words. Think of them as groups of utterances that humans and computers can both agree to assign a single meaning to.

To find out for a given variant (e.g. 753), how it was produced (which speaker, pitch, speed), use:

$ bash versionnames.sh |grep -w 753
753 english_wmids 140 70

Importing the dataset into Python

To create a numpy array containing the spectrograms, run:

$ bash generatenoise.sh # to generate or download the noise files that will be added
    $ python traincommanddetect.py

The output file is db.verbs.npz, a numpy compressed array, with the keys "audiodata" and "labels". The data shape is (nsamples=many, nframes=30, nspectralbins=513).

Training the method

Your turn. Go wild.

spoken-command-recognition's People

Contributors

johannesbuchner avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

spoken-command-recognition's Issues

sox problem

Dear @JohannesBuchner
I've gotten result of command detection and it has been amazing.
but I have a little problem with trainactivitydetect_prepare.py and faced following error:
"sox FAIL formats: sox FAIL formats: can't determine type of file 'db.verbs/verb name/verb number'"
I've surfed on the net and found that sox doesn't seem handle the opus file correctly.
and my question is how did you cope with that problem?

the Synthesized command dataset can work, or not?

Is this project finished successfully?
is there any conclusion about using a synthesized dataset to train a model?
I am thinking about do some similar experiment like this project and
hope anybody can give some suggestion.
Thx~

Dataset

Dear @JohannesBuchner ,
I really appreciate you for your great solution to simplify the problem of voice recognition.
I know such a dataset you've provided is invaluable but I'm just wondering would it be possible to share your great dataset only for academical use?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.