Giter VIP home page Giter VIP logo

small-footprint-keyword-spotting's Introduction

Small-footprint keyword spotting

Group members // Daniele Ninni, Nicola Zomer

In this work, we experiment with several neural network architectures as possible approaches for the keyword spotting (KWS) task. We run our tests on the Google Speech Commands dataset, one of the most popular datasets in the KWS context. We define a CNN model that outperforms our baseline model and we use it to study the impact of different preprocessing, regularization and feature extraction techniques. We see how, for instance, the log Mel-filterbank energy features lead to the best performance and we discover that the introduction of background noise on the training set with a reduction coefficient of 0.5 helps the model to learn. Then, we explore different machine learning models, such as ResNets, RNNs, attention-based RNNs and Conformers in order to achieve an optimal trade-off between accuracy and footprint. We find that this architectures offer between a 30-40% improvement in accuracy compared to the baseline, while reducing up to 10x the number of parameters.

Notebooks

  • Data analysis and preprocessing inspection
    This notebook takes care of loading and preparing the dataset, splitting it into training, validation, and testing sets. It also provides some information about the dataset with plots. Moreover, it introduces the functions used to preprocess the data (for example adding noise).

  • Keyword spotting: general training notebook
    This notebook defines the general training and testing pipeline, giving some information about the validation metrics used. The training is performed using our baseline model cnn-one-fpool3, taken from [Arik17].

  • Bayesian optimization and feature comparison with CNN
    This notebook is used to train our custom CNN models. With the first of these models we perform a Bayesian optimization, and we use it for inspecting the importance of dropout and batch normalization, realizing a feature comparison and studying the effect of data augmentation on the training set.

  • Keyword spotting: ResNet architecture and triplet loss implementation
    In this notebook we play with ResNet models for the keyword spotting task. We start by implementing a simple ResNet architecture inspired by [Tang18] and then, motivated by [Vygon21], we modify such model and we train it to get a meaningful embedded representation of the input signals. We finally use k-NN to perform the classification task on these intermediate representations.

  • Keyword spotting: a neural attention model for speech command recognition
    This notebook implements an attention model for speech command recognition. It is obtained as a modification of a demo notebook prepared by the authors of the paper A neural attention model for speech command recognition.

  • Keyword spotting: Conformer
    In this notebook, thanks to the library audio_classification_models, we implement a baseline Conformer architecture inspired by [Gulati20]. This model combines Convolutional Neural Networks and Transformers to get the best of both worlds by modeling both local and global features of an audio sequence in a parameter-efficient way. In detail, we use only one Conformer block in order to reduce the number of model parameters. Moreover, we perform hyperparameter tuning by means of Bayesian optimization in order to find, among the models with less than 2M parameters, the one that leads to the best accuracy.

  • Keyword spotting: GAN-based classification
    In this notebook we try to implement a GAN-based classifier inspired by the paper GAN-based Data Generation for Speech Emotion Recognition. Unfortunately, to date we have not been able to figure out how to properly train the generator and discriminator in this specific case. As a result, we cannot currently test this approach.

Utilities

Demo App

In this repository you can find a demo application that can be run as a Python script with python demo_ks.py. It allows you to select the model you want to use and, when started, it detects the words of the Speech Commands Dataset through the microphone (or any chosen input device).

You can also find a notebook that can be used to play some commands from the dataset, in order to test such application with non-real-time signals.


Human Data Analytics
University of Padua, A.Y. 2022/23

small-footprint-keyword-spotting's People

Contributors

danieleninni avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

Forkers

wuterry

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.