Giter VIP home page Giter VIP logo

audio-captioning-sub-sampling's Introduction

Temporal Sub-sampling of Audio Feature Sequences for Audio Captioning

Code for the paper Temporal Sub-sampling of Audio Feature Sequences for Automated Audio Captioning [Khoa Nguyen, Konstantinos Drossos, Tuomas Virtanen]. Set up the project by following the instructions from the baseline method repository https://github.com/audio-captioning/dcase-2020-baseline.

To conduct an experiment using the sub-sampling for Audio Captioning, run

python main.py -c main_settings -j 0 -d <path_to_settings> -v 

with path_to_setting being the path to the directory that contains .yaml files, e.g. settings/baseline.

Settings for the sub-sampling method

The file settings/subsampling4/no_attn_lr_1e-4_loss_thr_1e-3/model.yaml holds the settings for the baseline DNN:

use_pre_trained_model: No
encoder:
    input_dim_encoder: 64
    hidden_dim_encoder: 256
    output_dim_encoder: 256
    dropout_p_encoder: .25
    sub_sampling_factor_encoder: 4
decoder:
    output_dim_h_decoder: 256
    nb_classes:  # Empty, to be filled automatically.
    dropout_p_decoder: .25
    max_out_t_steps: 22
    mode: 0 # mode 0 for no attention, mode 1 for attention
    num_attn_layers: 0 # number of layers if using attention
    first_attn_layer_output_dim: 0

The use_pre_trained_model flag indicates if a pre-trained model will be used. If this flag is set to Yes, then the name of the file with the weights of the pre-trained model has to be specified in the settings/dirs_and_files.yaml file.

The encoder block has the settings for the encoder of the sub-sampling DNN:

  • the input dimensionality to the first layer of the encoder - input_dim_encoder
  • the hidden output dimensionality of the first and second layers of the encoder - hidden_dim_encoder
  • the output dimensionality of the third layer of the encoder - output_dim_encoder
  • the dropout probability for the encoder - dropout_p_encoder
  • the sub-sampling factor for the encoder - sub_sampling_factor_encoder

Similarly, the decoder block holds the settings for the decoder of the baseline DNN:

  • the output dimensionality of the RNN of the decoder - output_dim_h_decoder
  • the amount of classes for the classifier (it is filled automatically by the baseline system) - nb_classes
  • the dropout probability for the decoder - dropout_p_decoder
  • the maximum output time-steps for the decoder - max_out_t_steps
  • mode 0 for no attention in the decoder, mode 1 for using attention - mode
  • number of linear layers if using attention - num_attn_layers
  • the output dimensionality of the first layer in the attention mechanism first_attn_layer_output_dim

audio-captioning-sub-sampling's People

Contributors

dk-nguyen avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.