castorini / honk Goto Github PK

PyTorch implementations of neural network models for keyword spotting

License: MIT License

Python 58.27% Shell 0.14% Jupyter Notebook 41.58%

honk's Introduction

Honk: CNNs for Keyword Spotting

Honk is a PyTorch reimplementation of Google's TensorFlow convolutional neural networks for keyword spotting, which accompanies the recent release of their Speech Commands Dataset. For more details, please consult our writeup:

Raphael Tang, Jimmy Lin. Honk: A PyTorch Reimplementation of Convolutional Neural Networks for Keyword Spotting. arXiv:1710.06554, October 2017.
Raphael Tang, Jimmy Lin. Deep Residual Learning for Small-Footprint Keyword Spotting. Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 5479-5483.

Honk is useful for building on-device speech recognition capabilities for interactive intelligent agents. Our code can be used to identify simple commands (e.g., "stop" and "go") and be adapted to detect custom "command triggers" (e.g., "Hey Siri!").

Check out this video for a demo of Honk in action!

Demo Application

Use the instructions below to run the demo application (shown in the above video) yourself!

Currently, PyTorch has official support for only Linux and OS X. Thus, Windows users will not be able to run this demo easily.

To deploy the demo, run the following commands:

If you do not have PyTorch, please see the website.
Install Python dependencies: pip install -r requirements.txt
Install GLUT (OpenGL Utility Toolkit) through your package manager (e.g. apt-get install freeglut3-dev)
Fetch the data and models: ./fetch_data.sh
Start the PyTorch server: python .
Run the demo: python utils/speech_demo.py

If you need to adjust options, like turning off CUDA, please edit config.json.

Additional notes for Mac OS X:

GLUT is already installed on Mac OS X, so that step isn't needed.
If you have issues installing pyaudio, this may be the issue.

Server

Setup and deployment

python . deploys the web service for identifying if audio contain the command word. By default, config.json is used for configuration, but that can be changed with --config=<file_name>. If the server is behind a firewall, one workflow is to create an SSH tunnel and use port forwarding with the port specified in config (default 16888).

In our honk-models repository, there are several pre-trained models for Caffe2 (ONNX) and PyTorch. The fetch_data.sh script fetches these models and extracts them to the model directory. You may specify which model and backend to use in the config file's model_path and backend, respectively. Specifically, backend can be either caffe2 or pytorch, depending on what format model_path is in. Note that, in order to run our ONNX models, the packages onnx and onnx_caffe2 must be present on your system; these are absent in requirements.txt.

Raspberry Pi (RPi) Infrastructure Setup

Unfortunately, getting the libraries to work on the RPi, especially librosa, isn't as straightforward as running a few commands. We outline our process, which may or may not work for you.

Obtain an RPi, preferably an RPi 3 Model B running Raspbian. Specifically, we used this version of Raspbian Stretch.
Install dependencies: sudo apt-get install -y protobuf-compiler libprotoc-dev python-numpy python-pyaudio python-scipy python-sklearn
Install Protobuf: pip install protobuf
Install ONNX without dependencies: pip install --no-deps onnx
Follow the official instructions for installing Caffe2 on Raspbian. This process takes about two hours. You may need to add the caffe2 module path to the PYTHONPATH environment variable. For us, this was accomplished by export PYTHONPATH=$PYTHONPATH:/home/pi/caffe2/build
Install the ONNX extension for Caffe2: pip install onnx-caffe2
Install further requirements: pip install -r requirements_rpi.txt
Install librosa: pip install --no-deps resampy librosa
Try importing librosa: python -c "import librosa". It should throw an error regarding numba, since we haven't installed it.
We haven't found a way to easily install numba on the RPi, so we need to remove it from resampy. For our setup, we needed to remove numba and @numba.jit from /home/pi/.local/lib/python2.7/site-packages/resampy/interpn.py
All dependencies should now be installed. We should try deploying an ONNX model.
Fetch the models and data: ./fetch_data.sh
In config.json, change backend to caffe2 and model_path to model/google-speech-dataset-full.onnx.
Deploy the server: python . If there are no errors, you have successfully deployed the model, accessible via port 16888 by default.
Run the speech commands demo: python utils/speech_demo.py. You'll need a working microphone and speakers. If you're interacting with your RPi remotely, you can run the speech demo locally and specify the remote endpoint --server-endpoint=http://[RPi IP address]:16888.

Utilities

QA client

Unfortunately, the QA client has no support for the general public yet, since it requires a custom QA service. However, it can still be used to retarget the command keyword.

python client.py runs the QA client. You may retarget a keyword by doing python client.py --mode=retarget. Please note that text-to-speech may not work well on Linux distros; in this case, please supply IBM Watson credentials via --watson-username and --watson--password. You can view all the options by doing python client.py -h.

Training and evaluating the model

CNN models. python -m utils.train --type [train|eval] trains or evaluates the model. It expects all training examples to follow the same format as that of Speech Commands Dataset. The recommended workflow is to download the dataset and add custom keywords, since the dataset already contains many useful audio samples and background noise.

Residual models. We recommend the following hyperparameters for training any of our res{8,15,26}[-narrow] models on the Speech Commands Dataset:

python -m utils.train --wanted_words yes no up down left right on off stop go --dev_every 1 --n_labels 12 --n_epochs 26 --weight_decay 0.00001 --lr 0.1 0.01 0.001 --schedule 3000 6000 --model res{8,15,26}[-narrow]

For more information about our deep residual models, please see our paper:

Raphael Tang, Jimmy Lin. Deep Residual Learning for Small-Footprint Keyword Spotting. Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2018), April 2018, Calgary, Alberta, Canada.

There are command options available:

option	input format	default	description
`--audio_preprocess_type`	{MFCCs, PCEN}	MFCCs	type of audio preprocess to use
`--batch_size`	[1, n)	100	the mini-batch size to use
`--cache_size`	[0, inf)	32768	number of items in audio cache, consumes around 32 KB * n
`--conv1_pool`	[1, inf) [1, inf)	2 2	the width and height of the pool filter
`--conv1_size`	[1, inf) [1, inf)	10 4	the width and height of the conv filter
`--conv1_stride`	[1, inf) [1, inf)	1 1	the width and length of the stride
`--conv2_pool`	[1, inf) [1, inf)	1 1	the width and height of the pool filter
`--conv2_size`	[1, inf) [1, inf)	10 4	the width and height of the conv filter
`--conv2_stride`	[1, inf) [1, inf)	1 1	the width and length of the stride
`--data_folder`	string	/data/speech_dataset	path to data
`--dev_every`	[1, inf)	10	dev interval in terms of epochs
`--dev_pct`	[0, 100]	10	percentage of total set to use for dev
`--dropout_prob`	[0.0, 1.0)	0.5	the dropout rate to use
`--gpu_no`	[-1, n]	1	the gpu to use
`--group_speakers_by_id`	{true, false}	true	whether to group speakers across train/dev/test
`--input_file`	string		the path to the model to load
`--input_length`	[1, inf)	16000	the length of the audio
`--lr`	(0.0, inf)	{0.1, 0.001}	the learning rate to use
`--type`	{train, eval}	train	the mode to use
`--model`	string	cnn-trad-pool2	one of `cnn-trad-pool2`, `cnn-tstride-{2,4,8}`, `cnn-tpool{2,3}`, `cnn-one-fpool3`, `cnn-one-fstride{4,8}`, `res{8,15,26}[-narrow]`, `cnn-trad-fpool3`, `cnn-one-stride1`
`--momentum`	[0.0, 1.0)	0.9	the momentum to use for SGD
`--n_dct_filters`	[1, inf)	40	the number of DCT bases to use
`--n_epochs`	[0, inf)	500	number of epochs
`--n_feature_maps`	[1, inf)	{19, 45}	the number of feature maps to use for the residual architecture
`--n_feature_maps1`	[1, inf)	64	the number of feature maps for conv net 1
`--n_feature_maps2`	[1, inf)	64	the number of feature maps for conv net 2
`--n_labels`	[1, n)	4	the number of labels to use
`--n_layers`	[1, inf)	{6, 13, 24}	the number of convolution layers for the residual architecture
`--n_mels`	[1, inf)	40	the number of Mel filters to use
`--no_cuda`	switch	false	whether to use CUDA
`--noise_prob`	[0.0, 1.0]	0.8	the probability of mixing with noise
`--output_file`	string	model/google-speech-dataset.pt	the file to save the model to
`--seed`	(inf, inf)	0	the seed to use
`--silence_prob`	[0.0, 1.0]	0.1	the probability of picking silence
`--test_pct`	[0, 100]	10	percentage of total set to use for testing
`--timeshift_ms`	[0, inf)	100	time in milliseconds to shift the audio randomly
`--train_pct`	[0, 100]	80	percentage of total set to use for training
`--unknown_prob`	[0.0, 1.0]	0.1	the probability of picking an unknown word
`--wanted_words`	string1 string2 ... stringn	command random	the desired target words

JavaScript-based Keyword Spotting

Honkling is a JavaScript implementation of Honk. With Honkling, it is possible to implement various web applications with in-browser keyword spotting functionality.

Keyword Spotting Data Generator

In order to improve the flexibility of Honk and Honkling, we provide a program that constructs a dataset from youtube videos. Details can be found in keyword_spotting_data_generator folder

Recording audio

You may do the following to record sequential audio and save to the same format as that of speech command dataset:

python -m utils.record

Input return to record, up arrow to undo, and "q" to finish. After one second of silence, recording automatically halts.

Several options are available:

--output-begin-index: Starting sequence number
--output-prefix: Prefix of the output audio sequence
--post-process: How the audio samples should be post-processed. One or more of "trim" and "discard_true".

Post-processing consists of trimming or discarding "useless" audio. Trimming is self-explanatory: the audio recordings are trimmed to the loudest window of x milliseconds, specified by --cutoff-ms. Discarding "useless" audio (discard_true) uses a pre-trained model to determine which samples are confusing, discarding correctly labeled ones. The pre-trained model and correct label are defined by --config and --correct-label, respectively.

For example, consider python -m utils.record --post-process trim discard_true --correct-label no --config config.json. In this case, the utility records a sequence of speech snippets, trims them to one second, and finally discards those not labeled "no" by the model in config.json.

Listening to sound level

python manage_audio.py listen

This assists in setting sane values for --min-sound-lvl for recording.

Generating contrastive examples

python manage_audio.py generate-contrastive --directory [directory] generates contrastive examples from all .wav files in [directory] using phonetic segmentation.

Trimming audio

Speech command dataset contains one-second-long snippets of audio.

python manage_audio.py trim --directory [directory] trims to the loudest one-second for all .wav files in [directory]. The careful user should manually check all audio samples using an audio editor like Audacity.

honk's People

Contributors

Stargazers

Watchers

Forkers

zumbalamambo nieshaoshuai zhouyonglong grseb9s meowfei entn-at fatblue vikingmew chenglongchen thuwyh shubhampachori12110095 stevenlol arunpatala shichaoji aihill szhaomsft qaavi gaoyiyeah zhly0 jiangyangbo msobroza zhaoforever hbcbh1999 mingyang1996 liaorichard jingyonghou bilaldendani hellcoderz stocyr x389liu zuurw flyahead bobzengscut by2101 onucharles chenchy xinkez durgesh92 ruofeidu superever maggie0830 davidhli xdcesc wzwietering codeaudit ljj7975 beyondboy hellosaumil kumarkarun byfaith mekchone chizhang0814 yangdongchn cod3r0k ltcxjtu qhduan px4ugs davezqq peter974 xushoucai aaab8b orangebaowang smrk007 ajilim colinsongf dzungtx sciasdez sworborno runngezhang limyeonsoo sevencheng798 yh646492956 oziee abrutti xrick afd77 maryamnajafian nzennnaki jiegev5 satishpas2 ma521dy satishrlt xuhao1 jcambre yosshor zloop1982 elenazy starhaox proling1994 stevenchang8 shaheenkdr weimingtom appalachianwine gavin90s stuartiannaylor raspberry4faith daretowin aikalix skubicius guelermus

honk's Issues

How to train to reproduce the performance

Thanks for your work.
If I want use cnn-trad-pool2 model to train, what's the exact training parameters.

Does the following command from your readme is suitable for cnn-trad-pool2 model?

python -m utils.train --wanted_words yes no up down left right on off stop go --dev_every 1 --n_labels 12 --n_epochs 26 --weight_decay 0.00001 --lr 0.1 0.01 0.001 --schedule 3000 6000 --model res{8,15,26}[-narrow]

Thanks.

Where is the google-speech-dataset.pt ?

There is not a google-speech-dataset.pt which needed by the server.py in model.zip. I tried to get it by "python -m utils.train --data_folder ./training_data --type train", but I got a "model.pt" . Even I rename it and run "python .", an error will arise as below.

RuntimeError: Error(s) in loading state_dict for SpeechModel:
size mismatch for output.weight: copying a param of torch.Size([12, 26624]) from checkpoihe shape is torch.Size([4, 26624]) in current model.
size mismatch for output.bias: copying a param of torch.Size([12]) from checkpoint, whereis torch.Size([4]) in current model.

Caffe model

Hi
Can you please publish caffe prototxt and a trained caffe model?
ONNX does not support some operations and i am intersting in your output prototxt and the way you solve it.

Thanks!

ModuleNotFoundError: No module named 'aeneas.executetask'

!git clone https://github.com/readbeyond/aeneas.git
!git clone https://github.com/castorini/honk.git

!apt install libasound2-dev portaudio19-dev libportaudio2 libportaudiocpp0 ffmpeg

!mv /content/gdrive/My\ Drive/Python/search.py .

!python ../honk/keyword_spotting_data_generator/keyword_data_generator.py -k "give" -s 4

then i got an Error named No module named 'aeneas.executetask'
(i runed it in colab)

Reimplement ConvRNN model for voice query recognition

We should reimplement the model from our most recent keyword spotting-related paper.

evaluation on wrong model in train()

I think that there is an error in train(). Evaluate() at the end of the training uses the last trained model, not the best one according to the dev set. The results are probably very similar though.

The best model is actually saved but one has to run the code with --type eval --input_file "best_model" to get the actual accuracy on the eval set.

Wrong hyperlink to honk-models

The hyperlink to honk-models under the section Server is https://github.com/honk-models. It should be https://github.com/castorini/honk-models instead.

Is there a trained brasilian portuguese model?

I want to use this with some personal commands, like saying the name "amanda" to start my commands, but I need the bot to understand how we from Brazil say this name. Is there any model trained for that?

Use my own res8 trained model in the Demo Application

Hi,
I have trained a "res8" model and saved it as a file ".pt".
When I modify in the config.json file the "model_path" of my model, I can't modify the type of the model that I input into the demo.
In fact, I discovered that if I train by myself a "cnn-trad-pool2", which is the default in the train.py file, the demo works fine with such model.
So, I would like to know if there is a possibility to say that my input model is a "res8" and not a "cnn-trad-pool2".
I have tried to input in the config.json file "model": "res8", but it seems that it was not even read as an input.

Thank you in advance.

Unable to install `PyOpenGL_accelerate` on MacOS Mojave

Here is the error:
error: command 'clang' failed with exit status 1 ---------------------------------------- ERROR: Command "/usr/local/opt/python/bin/python3.7 -u -c 'import setuptools, tokenize;__file__='"'"'/private/var/folders/ww/cclx4hb97_q6dbmy_srts9mr0000gp/T/pip-install-ndyalced/PyOpenGL-accelerate/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /private/var/folders/ww/cclx4hb97_q6dbmy_srts9mr0000gp/T/pip-record-1qhkmp0n/install-record.txt --single-version-externally-managed --compile" failed with error code 1 in /private/var/folders/ww/cclx4hb97_q6dbmy_srts9mr0000gp/T/pip-install-ndyalced/PyOpenGL-accelerate/

I can't find any info on how to get it installed on MacOS, any ideas?

Honk 2.0

We plan to wrap up the existing codebase into a historical package and then overhaul it.

Upgrade PyTube to PyTube3 for Keyword Data Generator

Hello, opening this issue to let you know that there is a new, actively maintained, Python3 only pytube fork: https://github.com/hbmartin/pytube3

How to calculate number of multiplies?

I found the number of parameters and multiplies at the paper.

Even tough I check most of the code base, I couldn't find out a snippet that measures it.

Would you please tell me how to calculate it?

RuntimeError: Error(s) in loading state_dict for SpeechResModel

Hey guys,
I am trying to do the evaluation using for speech_commands dataset (a reduced one where I chose only some keywords).
The data directory looks like this:

.
├── _unknown_
│   ├── dog_2c6d3924_nohash_2.wav.wav
│   ├── dog_43fc47a7_nohash_0.wav.wav
│   ├── dog_4c6167ca_nohash_1.wav.wav
├── down
│   ├── 022cd682_nohash_0.wav
│   ├── 0c40e715_nohash_0.wav
│   ├── 0c540988_nohash_0.wav
├── go
│   ├── 022cd682_nohash_0.wav
│   ├── 022cd682_nohash_1.wav
│   ├── 0487ba9b_nohash_0.wav
├── left
│   ├── 022cd682_nohash_0.wav
│   ├── 022cd682_nohash_1.wav
│   ├── 03401e93_nohash_0.wav
└── no
    ├── 03401e93_nohash_0.wav
    ├── 03401e93_nohash_1.wav
    ├── 03401e93_nohash_2.wav

The command used for evaluation is:

python -m utils.train \
    --gpu_no 3 \
    --data_folder /Google_Speech_Commands/chosen \
    --type eval \
    --wanted_words _unknown_ down go left no \
    --model res8 \
    --input_file /root/honk/model/res8.pt

Here is the error it thows:

Traceback (most recent call last):
  File "/root/anaconda3/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/root/anaconda3/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/root/honk/utils/train.py", line 188, in <module>
    main()
  File "/root/honk/utils/train.py", line 185, in main
    evaluate(config)
  File "/root/honk/utils/train.py", line 66, in evaluate
    model.load(config["input_file"])
  File "/root/honk/utils/model.py", line 80, in load
    self.load_state_dict(torch.load(filename, map_location=lambda storage, loc: storage))
  File "/root/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 777, in load_state_dict
    self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for SpeechResModel:
	Unexpected key(s) in state_dict: "scale1.scale", "scale3.scale", "scale5.scale".

Can anyone help on this please?

Explore Sphinx as an alternative backend for command triggering

We've been using Google's speech command dataset but we should explore Sphinx as well.

Contrastive estimation for training

From what I understand, one of the issues is confusion between the command word and other words in the vocab, e.g., "siri" is part of "Anserini".

As a solution, why don't we apply the apply of contrastive estimation from our other work - once we record instances of the command word, we search through our existing corpus of recordings and find instances that are similar (i.e., confusing). We then ask the user to say those words n times, giving us contrastive examples.

Cannot build model from audio files with a length of 3 seconds

I'm trying to create my own model. Google's Command Speech Set serves as the basis. Additionally I have six keywords (alexa / jarvis / computer are three of them), which are longer than 1 second. Therefore I brought all WAVs to a length of 3 seconds (many have silence at the end). Then I call:

python -m utils.train --wanted_words alexa jarvis computer down left right learn dog sheila marvin --dev_every 1 --n_labels 12 --n_epochs 26 --weight_decay 0.00001 --lr 0.1 0.01 0.001 --schedule 3000 6000 --input_length 48000 --model res8 --no_cuda true --pos_key_size 1000 --data_folder ./speech_commands_v0.02/ --output_file ./speech_commands_v0.02/model.pt
(input_length is set to 48000 because of the audio lengths)

However, this leads to the following error:

File "workspace/voice/honk/utils/model.py", line 258, in collate_fn
audio_tensor = torch.from_numpy(self.audio_processor.compute_mfccs(audio_data).reshape(1, 101, 40))
ValueError: cannot reshape array of size 12040 into shape (1,101,40)

I don't know what to do with the message or how to fix it.
When adding param "--audio_preprocess_type PCEN" I am able to create the model. From this I can also create the file with the weights and use it in Honkling. But the recognition doesn't work at all. It constantly recognizes "computer" and nothing else, even if this keyword is not spoken at all or something is spoken at all.

What can I do to make it work?

Decrease test dataloader batch size

For me, decreasing the batch size of test dataloader was very helpful for GPU speed and memory.
So in train.py, changing

test_loader = data.DataLoader( test_set, batch_size=len(test_set), shuffle=False, collate_fn=test_set.collate_fn)

test_loader = data.DataLoader( test_set, batch_size=min(len(test_set), config["batch_size"] // 2), shuffle=False, collate_fn=test_set.collate_fn)

There doesn't seem to be an advantage to loading the entire test set on GPU at the same time. I tried to pull request this change but I don't think I'm allowed to. Hope this is helpful!

Thanks,
Bryan

Merge in Alexa frontend

Shawn, we should merge in the Alex frontend code into this repo also?

Are there any useable demo in this project ?

"python ." is not runnable on any model in the model.zip now. What is the default expected size in server.py , and how can I change it? It seems that many parameters are hard-coded everywhere.

Any chance Honk 2 might be in the wings?

Pytorch now has TORCHAUDIO so LibRosa and certain architecture install problems are not needed.

Also newer models such as CRNN & DS-CNN look interesting, but it would be great if some of these incremental additions could be integrated into an example of Honk2.

Stuart

ERROR: Internal error <FBConfig with necessary capabilities not found>

Hello,

I am using Ubuntu 16.04
Python 3.5.2
PyTorch 0.4.1

Installed requirements and downloaded data as well.
Modified config.json so cuda is not 'looked for' in my cpu-only machine.

server step works fine: python .
Encounter error while running speech_demo.py

python utils/speech_demo.py
ALSA lib pcm_dsnoop.c:606:(snd_pcm_dsnoop_open) unable to open slave
ALSA lib pcm_dmix.c:1029:(snd_pcm_dmix_open) unable to open slave
ALSA lib pcm.c:2266:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.rear
ALSA lib pcm.c:2266:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.center_lfe
ALSA lib pcm.c:2266:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.side
ALSA lib pcm_dmix.c:1029:(snd_pcm_dmix_open) unable to open slave
freeglut (foo): ERROR: Internal error in function fgOpenWindow

I don't think the error is due to any audio devices binding; I may be wrong... I am able to record and playback using arecord & aplay in this server, meaning no issues with the audio devices. There are no other applications using the audio devices while I run speech_demo.py

Please let me know how to fix it.

thanks,
Buvana

PyTorch version used for demo/training

On a fresh installation the PyTorch website gives the newest version to download. This causes an error when trying to launch the demo. I am on Ubuntu 16.04, Python 3.5.2, without CUDA. The config.json shows no_cuda as false. When I run python . in the demo instructions I get the following error:

 python .
./utils/model.py:143: UserWarning: volatile was removed and now has no effect. Use `with torch.no_grad():` instead.
  x = Variable(torch.zeros(1, 1, height, width), volatile=True)
Traceback (most recent call last):
  File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.5/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "./__main__.py", line 21, in <module>
    main()
  File "./__main__.py", line 18, in main
    server.start(config)
  File "./server.py", line 144, in start
    lbl_service = load_service(config)
  File "./server.py", line 128, in load_service
    lbl_service = TorchLabelService(model_path, labels=commands, no_cuda=config["model_options"]["no_cuda"])
  File "./service.py", line 78, in __init__
    self.reload()
  File "./service.py", line 85, in reload
    self.model.cuda()
  File "/home/johnsigmon/programming/ml-sandbox/reactions/reaction-009-AVR/honk/.env/lib/python3.5/site-packages/torch/nn/modules/module.py", line 258, in cuda
    return self._apply(lambda t: t.cuda(device))
  File "/home/johnsigmon/programming/ml-sandbox/reactions/reaction-009-AVR/honk/.env/lib/python3.5/site-packages/torch/nn/modules/module.py", line 185, in _apply
    module._apply(fn)
  File "/home/johnsigmon/programming/ml-sandbox/reactions/reaction-009-AVR/honk/.env/lib/python3.5/site-packages/torch/nn/modules/module.py", line 191, in _apply
    param.data = fn(param.data)
  File "/home/johnsigmon/programming/ml-sandbox/reactions/reaction-009-AVR/honk/.env/lib/python3.5/site-packages/torch/nn/modules/module.py", line 258, in <lambda>
    return self._apply(lambda t: t.cuda(device))
RuntimeError: Cannot initialize CUDA without ATen_cuda library. PyTorch splits its backend into two shared libraries: a CPU library and a CUDA library; this error has occurred because you are trying to use some CUDA functionality, but the CUDA library has not been loaded by the dynamic linker for some reason.  The CUDA library MUST be loaded, EVEN IF you don't directly use any symbols from the CUDA library! One common culprit is a lack of -Wl,--no-as-needed in your link arguments; many dynamic linkers will delete dynamic library dependencies if you don't depend on any of their symbols.  You can check if this has occurred by using ldd on your binary to see if there is a dependency on *_cuda.so library.

If you have the PyTorch version used for the demo that seems the easiest way for me to get past this. Let me know if you have other suggestions though. Thanks!

how to evaluate on "Tensorflow Speech Commands Dataset"?

In your paper "Honk: A PyTorch Reimplementation of Convolutional Neural Networks for Keyword Spotting", it is mention pytorch implementation achieved 87.5% ± 0.340 accuracy (table.3). How can I reproduce this evaluation results?

i am using the commands on the python script train.py:

python train.py --model cnn-trad-pool2 --input_file "//model/google-speech-dataset.pt" --data_folder "//data/speech_commands_v0.01" --gpu_no 0 --n_labels 12 --wanted_words yes no up down left right on off stop go

Is it correct?

Merge RPi power consumption scripts

Merge the test harness and all code necessary to evaluate RPi power consumption of various CNNs.

an error has occured while processing video HTTP Error 429: Too Many Requests

def retrieve_keyword_audio(vid, keyword):
audio_index = 0
v_url = URL_TEMPLATE.format(vid)
youtube = YouTube(v_url)

y_len = youtube.player_config_args['player_response']['videoDetails']['lengthSeconds']
print("Length:", y_len)
print("Views:", youtube.views)

if int(y_len) > 2700:
    # only consider video < 45 mins
    return audio_index

i got an error with length of video so i modifed code of keyword_data_generator.py
but i got error named "HTTP Error 429: Too Many Requests"
how can i solve this problem(help me)

Question about the dimension of mfcc

Hi,

Can you help explain why the time dimension is 101 instead of 100?

OSError: [Errno -9996] Invalid input device

After PyTorch server started, "speech_demo.py" gives following error:

# python utils/speech_demo.py

ALSA lib confmisc.c:768:(parse_card) cannot find card '0'
ALSA lib conf.c:4292:(_snd_config_evaluate) function snd_func_card_driver returned error: No such file or directory
ALSA lib confmisc.c:392:(snd_func_concat) error evaluating strings
ALSA lib conf.c:4292:(_snd_config_evaluate) function snd_func_concat returned error: No such file or directory
ALSA lib confmisc.c:1251:(snd_func_refer) error evaluating name
ALSA lib conf.c:4292:(_snd_config_evaluate) function snd_func_refer returned error: No such file or directory
ALSA lib conf.c:4771:(snd_config_expand) Evaluate error: No such file or directory
ALSA lib pcm.c:2266:(snd_pcm_open_noupdate) Unknown PCM sysdefault
ALSA lib confmisc.c:768:(parse_card) cannot find card '0'
ALSA lib conf.c:4292:(_snd_config_evaluate) function snd_func_card_driver returned error: No such file or directory
ALSA lib confmisc.c:392:(snd_func_concat) error evaluating strings
ALSA lib conf.c:4292:(_snd_config_evaluate) function snd_func_concat returned error: No such file or directory
ALSA lib confmisc.c:1251:(snd_func_refer) error evaluating name
ALSA lib conf.c:4292:(_snd_config_evaluate) function snd_func_refer returned error: No such file or directory
ALSA lib conf.c:4771:(snd_config_expand) Evaluate error: No such file or directory
ALSA lib pcm.c:2266:(snd_pcm_open_noupdate) Unknown PCM sysdefault
ALSA lib pcm.c:2266:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.front
ALSA lib pcm.c:2266:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.rear
ALSA lib pcm.c:2266:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.center_lfe
ALSA lib pcm.c:2266:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.side
ALSA lib pcm.c:2266:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.surround21
ALSA lib pcm.c:2266:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.surround21
ALSA lib pcm.c:2266:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.surround40
ALSA lib pcm.c:2266:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.surround41
ALSA lib pcm.c:2266:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.surround50
ALSA lib pcm.c:2266:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.surround51
ALSA lib pcm.c:2266:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.surround71
ALSA lib pcm.c:2266:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.iec958
ALSA lib pcm.c:2266:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.iec958
ALSA lib pcm.c:2266:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.iec958
ALSA lib pcm.c:2266:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.hdmi
ALSA lib pcm.c:2266:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.hdmi
ALSA lib pcm.c:2266:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.modem
ALSA lib pcm.c:2266:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.modem
ALSA lib pcm.c:2266:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.phoneline
ALSA lib pcm.c:2266:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.phoneline
ALSA lib confmisc.c:768:(parse_card) cannot find card '0'
ALSA lib conf.c:4292:(_snd_config_evaluate) function snd_func_card_driver returned error: No such file or directory
ALSA lib confmisc.c:392:(snd_func_concat) error evaluating strings
ALSA lib conf.c:4292:(_snd_config_evaluate) function snd_func_concat returned error: No such file or directory
ALSA lib confmisc.c:1251:(snd_func_refer) error evaluating name
ALSA lib conf.c:4292:(_snd_config_evaluate) function snd_func_refer returned error: No such file or directory
ALSA lib conf.c:4771:(snd_config_expand) Evaluate error: No such file or directory
ALSA lib pcm.c:2266:(snd_pcm_open_noupdate) Unknown PCM default
ALSA lib confmisc.c:768:(parse_card) cannot find card '0'
ALSA lib conf.c:4292:(_snd_config_evaluate) function snd_func_card_driver returned error: No such file or directory
ALSA lib confmisc.c:392:(snd_func_concat) error evaluating strings
ALSA lib conf.c:4292:(_snd_config_evaluate) function snd_func_concat returned error: No such file or directory
ALSA lib confmisc.c:1251:(snd_func_refer) error evaluating name
ALSA lib conf.c:4292:(_snd_config_evaluate) function snd_func_refer returned error: No such file or directory
ALSA lib conf.c:4771:(snd_config_expand) Evaluate error: No such file or directory
ALSA lib pcm.c:2266:(snd_pcm_open_noupdate) Unknown PCM default
ALSA lib confmisc.c:768:(parse_card) cannot find card '0'
ALSA lib conf.c:4292:(_snd_config_evaluate) function snd_func_card_driver returned error: No such file or directory
ALSA lib confmisc.c:392:(snd_func_concat) error evaluating strings
ALSA lib conf.c:4292:(_snd_config_evaluate) function snd_func_concat returned error: No such file or directory
ALSA lib confmisc.c:1251:(snd_func_refer) error evaluating name
ALSA lib conf.c:4292:(_snd_config_evaluate) function snd_func_refer returned error: No such file or directory
ALSA lib conf.c:4771:(snd_config_expand) Evaluate error: No such file or directory
ALSA lib pcm.c:2266:(snd_pcm_open_noupdate) Unknown PCM dmix
Traceback (most recent call last):
  File "utils/speech_demo.py", line 266, in <module>
    main()
  File "utils/speech_demo.py", line 262, in main
    app = DemoApplication(LabelClient(flags.server_endpoint))
  File "utils/speech_demo.py", line 164, in __init__
    frames_per_buffer=self.chunk_size, stream_callback=self._on_audio)
  File "/root/anaconda3/lib/python3.7/site-packages/pyaudio.py", line 750, in open
    stream = Stream(self, *args, **kwargs)
  File "/root/anaconda3/lib/python3.7/site-packages/pyaudio.py", line 441, in __init__
    self._stream = pa.open(**arguments)
OSError: [Errno -9996] Invalid input device (no default output device)

Any idea why this error... ?

I'm using Docker on Ubuntu 16.04 LTS

How to get an uncertain length of audio into the model

i was trying to feed some audio data from my own into the pretrained model, however, it seems like the tensor shape can only be (101,40), how should i resize my own data to such shape like that with uncertain length?

Can you please specify library versions in the requirement files

The requirement files do not specify the versions of each library which is making it hard for me (and I assume for others) to run the code.

Compare vs. existing projects

I did a quick search and found several other projects that have similar goals. We should find out if we're reinventing the wheel:

onnx models won't pass the checker (Raspberry Pi)

Hi,
I have followed the instructions for running on Raspberry Pi infrastructure (using RPI 3 model B with latest raspbian) and have encountered the following error when loading any .onnx models:

File "/home/pi/honk/main.py", line 18, in main
server.start(config)
File "./server.py", line 144, in start
lbl_service = load_service(config)
File "./server.py", line 126, in load_service
lbl_service = Caffe2LabelService(model_path, commands)
File "./service.py", line 62, in init
self.model = onnx_caffe2.backend.prepare(self._graph)
File "/home/pi/.local/lib/python2.7/site-packages/onnx_caffe2/backend.py", line 513, in prepare
super(Caffe2Backend, cls).prepare(model, device, **kwargs)
File "/home/pi/.local/lib/python2.7/site-packages/onnx/backend/base.py", line 53, in prepare
onnx.checker.check_model(model)
File "/home/pi/.local/lib/python2.7/site-packages/onnx/checker.py", line 32, in checker
proto.SerializeToString(), ir_version)
onnx.onnx_cpp2py_export.checker.ValidationError: Unrecognized attribute: dilations
==>
Context: Bad node spec: input: "13" output: "15" op_type: "MaxPool" attribute { name: "kernel_sha pe" ints: 2 ints: 2 } attribute { name: "pads" ints: 0 ints: 0 } attribute { name: "dilations" ints: 1 ints: 1 } attribute { name: "strides" ints: 2 ints: 2 }

It seems that my pooling layers do not support dilation?
(which coincides with operator descriptions [https://github.com/onnx/onnx/blob/master/docs/Operators.md#MaxPool] )

May I ask what versions/branches did you use for successful loading of the model?
Or if I can fix this in any other way?

Note: In step 9, I've had a few more files that needed numba and jit commenting under the librosa package folder

Port TF audio ops to PyTorch

Secondary objective is to port audio processing ops to PyTorch. Specifically, the following need to be implemented without third-party library support:

FFT/STFT (inverse FFT isn't required)
log-Mel filterbank
fast DCT

torchaudio provides a wrapper of librosa's Mel spectrogram function (first two points). However, this implementation is claimed as relatively slow and non-GPU accelerated. It would be nice to implement the entire audio processing pipeline in a fast PyTorch-friendly manner.

'from utils.manage_audio import preprocess_audio' in service.py is missing in utils.manage_audio.py

Use Keyword Spotting Data Generator to generate non-English

Dear,It was a great project and I learned a lot.I want to make non-english datasets, such as Chinese datasets, how should I operate?

Demo of TensorFlow port

We should have a simple demo of the TensorFlow port, similar to the "prebuilt TensorFlow Android demo application" here

Should be a simple Python command to start fire up demo, showing the list of commands, which light up as you say the words. A spectrograph vis would be nice also. Demo should wget the pre-trained model from honk-models.

Update README accordingly.

How to calculate False alarm rate and false rejection rate

Hi,
thank you for sharing the code
I am a bit curious about how you calculate the false alarm and false rejection rate like in your paper.
I hope you can tell me how to do it

Input Features Size

thank for your great jobs .
I see the model config in https://github.com/castorini/honk/blob/master/utils/model.py#L367-L400 ，the model input height is 101 ,It's means the input wav is over 1s when the frame is 10ms .why not choose the standard left 23 and right 8 ?

New speech command dataset

Should we give a hint about the new speech_command dataset in the utils/speech_commands_example/train.py and the Readme?

http://download.tensorflow.org/data/speech_commands_v0.02.tar.gz

Instruction for Running the demo on windows OS

Instructions are in this pull requests #41

Technical Explanation of Desktop Application

Hi,
I would like to know what is the technical process happening "behind the scenes" when launching the Demo Application with command python.py and then python utils/speech_demo.py.

How is the audio (streaming now, not a .wav recored of 1 second length) treated as input?

Is it segmented in pieces of (overlapping) 1 seconds, as it seems in the log, after launching python .?

How is the posterior handling managed? Is there a reasoning similar to what proposed by Parada?

The reason why I am interested in the functioning of the Demo Application is that I tried to submit an own-recorded audio to the train.py by using --mode eval and I got a single prediction. So, the wav file was basically converted into a single image, which led eventually to a single prediction.

Thanks in advance.

ONNX models export

Hi,

I am trying to reproduce the raspberry-pi demo results, however I cannot produce my own onnx models.

I tried to train a model of my own, and then convert it to onnx using torch.onnx.export. (as the end of first code cell here). However, this does not work because onnx does not have the unsqueeze operator which is used in both the SpeechModel and SpeechResModel forward functions. I get this warning :

ONNX export failed on unsqueeze because torch.onnx.symbolic.unsqueeze does not exist

It appears the unsqueeze operator is not yet implemented in onnx, though they seem to be working on it (cf onnx/onnx#497). Did you use a specific branch of onnx or torch.onnx ?

Could you please provide the code you used to generate your onnx models, or indicate how to export a pytorch model to onnx ?

Thank you

Write README for RPi

@daemon Write the README for getting Caffe2 and all necessary infrastructure working on RPi.

Logo Proposal for Honk

Hello, as you requested I create this design for Honk, see if you like it.

confusion about train_set and dev_set

How to concatenate train_set and dev_set?

无法获取trainning data ，下载路径没有那个文件

Hi,
I have tried your Demo Application with ./fetch_data.sh
Would you give me some help for downloading data? If convenient,please send me data to my email ( [email protected])

train on custom dataset

what are the steps that i have to follow to train on my custom dataset.
i have 2500 files for each keyword.

how to train new google-speech-dataset.pt

hi, daemon,Thank you for your project, i want to train a new model like "google-speech-dataset.pt",
i am using the commands on the python script train.py:
python -m utils.train --model cnn-trad-pool2 --output_file "model/google-speech_new.pt" --data_folder "data/speech_dataset" --n_labels 12 --wanted_words yes no up down left right on off stop go --no_cuda true

or
python -m utils.train --wanted_words yes no up down left right on off stop go --dev_every 1 --n_labels 12 --n_epochs 26 --weight_decay 0.00001 --lr 0.1 0.01 0.001 --schedule 3000 6000 --no_cuda true --model cnn-trad-pool2