Giter VIP home page Giter VIP logo

trigger-word-detection's Introduction

Trigger-Word-Detection


Introduction

  • This implementation was inspired by the Deep Learning Specialization on Coursera by Andrew Ng.
  • A trigger word is a word that you wake up a virtual voice assistant, for example “Ok Google”.
  • Trigger word detection works by listening to a stream of audio, and preprocessing it before sending it the model, which predicts whether the trigger word is there or not.
  • We have trained our model for the trigger word 'activate'.

Sections


Time to cook some data.

  • We have 3 types of audio recording.

    • Positives
    • Negatives
    • Backgrounds
  • Positives are the trigger words on which we want our system to wake up.

  • Negatives are the words on which our system should not respond.

  • Backgrounds are the recordings of the noises present in different enviroments.

  • To generate training data, we randomly pick audio from positives and negatives, and overlay them on the background noises.


Let's go deeper.

  • In an audio data we get the variation of the amplitude of variation in air pressure with respect to time. You can think of an audio recording as a long list of numbers measuring the little air pressure changes detected by the microphone.

  • It is quite difficult to figure out from this "raw" representation of audio whether the word "activate" was said. In order to help our sequence model more easily learn to detect triggerwords, we will compute a spectrogram of the audio.

  • Visual representation of frequencies of a given signal with time is called Spectrogram. In a spectrogram representation plot one axis represents the time, the second axis represents frequencies and the colors represent magnitude (amplitude) of the observed frequency at a particular time.

  • Spectrogram of one of our training is shown below:


Let's have a look at our model.

  • The architecture of the model consists of 1-D convolutional layers, GRU layers, and dense layers.

  • The bottom most layer is a 1D CONV layer which is similar to a convolutional layer used in image classification networks. It converts the input of length 5511 time steps into 1375.

  • Conv layer is followed by batch normalization, activation and a drop-out layer.

  • GRUs(Gated recurrent units) are improved version of standard recurrent neural network. GRU aims to solve the vanishing gradient problem which comes with a standard recurrent neural network.

  • Let's view the summary of our model.

model.summary()

  • As you can see that our model has about 500K trainable parameters. It's gonna take a long time. To save time, a model is trained for about 3 hours on a GPU using the architecture you built above, and a large training set of about 4000 examples. This model is saved in the file 'model.h5'.

  • We trained the model further, using the Adam optimizer and binary cross entropy loss with a small training set of 26 examples, which we created. This model is saved in the file 'model_trained.h5'.

  • If we plot the spectrogram of the audio and the prediction values of our model we can see the peaks depecting the trigger word, like the one below.



trigger-word-detection's People

Contributors

paavangpt avatar shivammalviya712 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

trigger-word-detection's Issues

load mp3 or wav files

how can i test my wav file that there is a trigger word in my file or not ? i don't want to use real time

AttributeError: 'tuple' object has no attribute 'ndim'

Model is loaded
Realtime is done
start
yesssss
(array([ 0.00747681, 0.00985718, 0.01608276, ..., -0.00848389,
-0.01101685, -0.0112915 ]), 44100)
Traceback (most recent call last):
File "main.py", line 23, in
realtime.refresh_audio()
File "/home/dimanshu/Trigger-Word-Detection/code/realtime.py", line 41, in refresh_audio
self.new_x = self.spectrogram(self.new_audio).T
File "/home/dimanshu/Trigger-Word-Detection/code/realtime.py", line 63, in spectrogram
nchannels = sound.ndim
AttributeError: 'tuple' object has no attribute 'ndim'

for multiple trigger word

can i use this model for multiple word detection ?
i want to train model for 5-6 trigger words . so i have to add all the wav files into positive folder right ? thats it ? will this work for multiple trigger words ?

Pyaudio and Portaudio errors

  • Couldn't find ffmpeg or avconv - defaulting to ffmpeg, but may not work
  • Could not import the pyaudio c module '_portaudio'.
  • ImportError: DLL load failed: The specified module could not be found.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.