Giter VIP home page Giter VIP logo

multimodal-net's Introduction

Project1: Sound -> Image

This project started with a motivation to build end-to-end model perform sound to image. Last contributions: https://openreview.net/forum?id=SJxJtiRqt7 model Image 1

Implementations

  1. Model Structure
  - [x] implement spatial dropout
  - [-] implement U-net++ structure ---(PURGED)
   -> [x] implement partial skip connection between sound and image
  - [x] implement resnet block
  - [x] implement inverted residual block
  - [x] implement self-attention
  - [x] implement squeeze-exitation block
  - [x] implement handling spectal normalization
  - [x] implement normalizations
  - [x] implement batch-instance normalization
  - [x] implement spectral normalization
  - [x] implement "weights_init" to init VARIABLES
  - [x] implement WGAN-GP loss
  - [x] implement Hinge loss
  - [x] implement multi-scale perceptual loss
  - [x] implement multi-scale feature matching
  - [x] implement PatchGAN discriminator
  - [x] implement wide-domain-feature trainer with extra image data
  1. Preprocessing
  - [x] SOUND:
    - [x] STFT handling
    - [x] Normalizer
  - [x] IMAGE: implement image Normalizer
  1. DATASET
  - [x] video splitter
      - 1. extract frames, sound from video.
      - 2. parallelize
  - [x] sound and frame custom data loader for build datasets
  - [x] multiple audio files handle on stft
  1. Training
  - [x] implement trainer
  - [x] implement save & load handler
  1. Testing
  - [ ] LATER
  1. Resouce handling
  - [x] CPUs or GPUs
  1. cmds & others
  - [x] Define to use fire
  - [x] Define & Build docker image

HOW TO RUN

  1. Download video
python -m scripts.v000_video_download downloads --video_codes=<youtube video codes> --save_dir=<path_to_save>
  1. Build dataset from video
python -m scripts.v001_dataset_builder video_to_datasets --video_path_dir=<video_path_dir> --file_filter="*" --offsets=[10] --save_dir="./dataset" --device="cpu" --start_index=0
  1. Optional: build dataset from image and sounds
python -m scripts.v001_dataset_builder_v2 frame_n_audio_to_datasets --frame_path_dir=<fram_dir> --audio_path_dir=<audio_dir> --file_filter="*" --save_dir="./dataset" --device="cpu" --start_index=0
  1. Optional: build audio labels using PANNs
python -m scripts.v002_generate_audio_labels generate --data_dir=<data_dir> batch_size=256 device="cuda", sr=16000)
  1. Train the model

Main structure

ex)
python -m scripts.v003_sound2image train --data_dir '/workspace/codes/datasets/v002_combine_dataset/train' --test_data_dir '/workspace/codes/datasets/v002_combine_dataset/test' --extraimg_data_dir '/workspace/codes/rawdata/005_open_images_v6_resized' --extraimg_type 'jpg' --d_config '{"mel_normalizer_savefile": "/workspace/codes/exps/config/mel_normalizer.json"}' --m_config '{"batch_size": 256, "epochs": 500, "recon_lambda": 10.0, "fm_lambda": 0.0, "pl_lambda": 0.1, "g_ce_lambda": 0.1, "d_ce_lambda": 0.1, "extraimg_ratio": 0.1, "g_sn": False, "d_sn": True, "loss_type": "hinge", "g_norm": "BN", "flip": True, "dropout": 0.1, "load_strict": False}' --exp_dir '/workspace/codes/exps/v001/''

Second structure

ex)
python -m scripts.v003_sound2image_v2 train --data_dir '/workspace/codes/datasets/v002_combine_dataset/train' --test_data_dir '/workspace/codes/datasets/v002_combine_dataset/test'  --d_config '{"mel_normalizer_savefile": "/workspace/codes/exps/config/mel_normalizer.json"}' --m_config '{"batch_size": 256, "epochs": 500, "recon_lambda": 10.0, "recon_feat_lambda": 1.0, "g_sn": False, "d_sn": True, "loss_type": "hinge", "g_norm": "BIN", "flip": True, "dropout": 0.1, "load_strict": False}' --exp_dir '/workspace/codes/exps/v001/'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.