Giter VIP home page Giter VIP logo

light-sernet's Introduction

Light-SERNet

This is the Tensorflow 2.x implementation of our paper "Light-SERNet: A lightweight fully convolutional neural network for speech emotion recognition", accepted in ICASSP 2022.

In this paper, we propose an efficient and lightweight fully convolutional neural network(FCNN) for speech emotion recognition in systems with limited hardware resources. In the proposed FCNN model, various feature maps are extracted via three parallel paths with different filter sizes. This helps deep convolution blocks to extract high-level features, while ensuring sufficient separability. The extracted features are used to classify the emotion of the input speech segment. While our model has a smaller size than that of the state-of-the-art models, it achieves a higher performance on the IEMOCAP and EMO-DB datasets.

Demo

Demo on EMO-DB dataset: Open In Colab

Run

1. Clone Repository

$ git clone https://github.com/AryaAftab/LIGHT-SERNET.git
$ cd LIGHT-SERNET/

2. Requirements

  • Tensorflow >= 2.3.0
  • Numpy >= 1.19.2
  • Tqdm >= 4.50.2
  • Matplotlib> = 3.3.1
  • Scikit-learn >= 0.23.2
$ pip install -r requirements.txt

3. Data:

  • Download EMO-DB and IEMOCAP(requires permission to access) datasets
  • extract them in data folder

Note: For using IEMOCAP dataset, please follow issue #3.

4. Set hyperparameters and training config :

You only need to change the constants in the hyperparameters.py to set the hyperparameters and the training config.

5. Strat training:

Use the following code to train the model on the desired dataset, cost function, and input length(second).

  • Note 1: The input is automatically cut or padded to the desired size and stored in the data folder.
  • Note 2: The best model are saved in the model folder.
  • Note 3: The results for the confusion matrix are saved in the result folder.
$ python train.py -dn {dataset_name} \
                  -id {input durations} \
                  -at {audio_type} \
                  -ln {cost function name} \
                  -v {verbose for training bar} \
                  -it {type of input(mfcc, spectrogram, mel_spectrogram)} \
                  -c {type of cache(disk, ram, None)} \
                  -m {fuse mfcc feature extractor in exported tflite model}

Example:

EMO-DB Dataset:

python train.py -dn "EMO-DB" \
                -id 3 \
                -at "all" \
                -ln "focal" \
                -v 1 \
                -it "mfcc" \
                -c "disk" \
                -m false

IEMOCAP Dataset:

python train.py -dn "IEMOCAP" \
                -id 7 \
                -at "impro" \
                -ln "cross_entropy" \
                -v 1 \
                -it "mfcc" \
                -c "disk" \
                -m false

Note : For all experiments just run run.sh

sh run.sh

Fusing MFCC Extractor(New Feature)

To run the model independently and without the need for the Tensorflow library, the MFCC feature extractor was added as a single layer to the beginning of the model. Then, The trained model was exported as a single file in the TensorFlow Lite format. The input of this model is raw sound in the form of a vector (1, sample_rate * input_duration). To train with fusing feature:

python train.py -dn "EMO-DB" \
                -id 3 \
                -m True
  • Note 1: The best model are saved in the model folder.
  • Note 2: To run tflite model you can just use tflite_runtime library. For using tflite_runtime library in this project, you need to build it with TF OP support(Flex delegate). you can learn how to built Tenorflow Lite from source with this flag here.
  • Note 3: To run tflite model as a real-time application another repository will be completed soon.

Citation

If you find our code useful for your research, please consider citing:

@inproceedings{aftab2022light,
  title={Light-SERNet: A lightweight fully convolutional neural network for speech emotion recognition},
  author={Aftab, Arya and Morsali, Alireza and Ghaemmaghami, Shahrokh and Champagne, Benoit},
  booktitle={ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  pages={6912--6916},
  year={2022},
  organization={IEEE}
}

light-sernet's People

Contributors

aryaaftab avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

light-sernet's Issues

InvalidArgumentError: Cannot batch tensors with different shapes in component 0.

Hello! Good job! But I have an error. I want to test the model with my audio files. I have created a folder my_test_3.0s_Segmented in date where the audio is tagged by emotion. Everything goes well, but I always get an error at the moment: list(test_dataset.as_numpy_iterator())
InvalidArgumentError: Cannot batch tensors with different shapes in component 0. First element had shape [103,40,1] and element 1 had shape [92,40,1]. [Op:IteratorGetNext]
This prevents me from testing. I used my code on test data generated while training the model. The code works and I get the result. How can I fix it?

MFCC hop size problem.

"Good job on the paper. However, there seems to be a discrepancy regarding the frame overlaps and hop size between your text and the provided code. In your paper, it's stated that a Hamming window is used to split the audio signal into 64-ms frames with 16-ms overlaps, which are considered as quasi-stationary segments. From this, it would logically follow that the hop size is 48 ms.

However, in the hyperparameters.py file, it's stated "FRAME_STEP = 256". Given a sampling rate (fs) of 16 kHz, this implies a hop size of 16 ms, not 48 ms. Could you please clarify if there's a typographical error in the paper, or if there's a specific reason for this inconsistency?"

cannot run the IEMOCAP dataset on windows

Hello, could you show the data folder architecture so I understand the way you organised the dataset.
I kept getting errors to segment the data.
I extracted the IEMOCAP_full_release in the data folder the renamed it as IEMOCAP, however, I kept getting errors of files not found.

function cleaning_directory_filename()

I think the function cleaning_directory_filename() breaks the speaker independence in the paper, i.e., 10-fold cross-validation, causing speaker overlap in the training and test sets. Removing this function, I get an 8% drop in WA. Could you explain my confusion.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.