Light

stonemo / avgn Goto Github PK

View Code? Open in Web Editor NEW

31.0 1.0 4.0 3.09 MB

Official implementation for AVGN

License: Apache License 2.0

Python 100.00%

audio-visual-correspondence audio-visual-learning visual-sound-localization weakly-supervised-learning

avgn's Introduction

[CVPR-2023] Audio-Visual Grouping Network for Sound Localization from Mixtures

AVGN is a new approach for disentangling category-wise semantic features for each source from the mixture and image to localize multiple sounding sources simultaneously.

AVGN Illustration

Environment

To setup the environment, please simply run

pip install -r requirements.txt

Datasets

MUSIC

Data can be downloaded from Sound of Pixels

VGG-Instruments

Data can be downloaded from Mix and Localize: Localizing Sound Sources in Mixtures

VGG-Sound Source

Data can be downloaded from Localizing Visual Sounds the Hard Way

Train

For training the AVGN model, please run

python train.py --multiprocessing_distributed \
    --train_data_path /path/to/vgginstruments/train/ \
    --test_data_path /path/to/vgginstruments/ \
    --test_gt_path /path/to/vgginstruments/anno/ \
    --experiment_name vgginstruments_multi_avgn \
    --model 'avgn' \
    --trainset 'vgginstruments_multi' --num_class 37 \
    --testset 'vgginstruments_multi' \
    --epochs 100 \
    --batch_size 128 \
    --init_lr 0.0001 \
    --attn_assign soft \
    --dim 512 \
    --depth_aud 3 \
    --depth_vis 3

Test

For testing and visualization, simply run

python test.py --test_data_path /path/to/vgginstruments/ \
    --test_gt_path /path/to/vgginstruments/anno/ \
    --model_dir checkpoints \
    --experiment_name vgginstruments_multi_avgn \
    --model 'avgn' \
    --testset 'vgginstruments_multi' \
    --alpha 0.3 \
    --attn_assign soft \
    --dim 512 \
    --depth_aud 3 \
    --depth_vis 3

Citation

If you find this repository useful, please cite our paper:

@inproceedings{mo2023audiovisual,
  title={Audio-Visual Grouping Network for Sound Localization from Mixtures},
  author={Mo, Shentong and Tian, Yapeng},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2023}
}

avgn's People

Contributors

Stargazers

Watchers

Forkers

yapengtian zhaozhipeng1997 ayanokouji0 sm-ys

avgn's Issues

Data set processing

Could you please provide the script of data set processing, thank you very much

Dataset Download

Thank you for your best work! I follow your introduction but fail to download the datasets. Please provide the datasets download links in your paper and give me a more detailed introduction.

How to download VGG-Instruments dataset

The link from “https://github.com/hxixixh/mix-and-localize” can not found this dataset.

Pretrain models

Could you release the pre-trained models of this project?

Incorrect Evaluation Method for Multi-Source Evaluation

The paper mentions the utilization of 448 x 224 images for evaluation purposes. However, the evaluation process involves splitting the images into two 224 x 224 images in your code. This is totally different from the described methodology in the paper and "mix and localize" paper. Figure 2 in your paper suggest the generation of n localization maps for n mixture sources from a single image, whereas the evaluation in the code operates on already splitted n images. The method suggested in the Figure 2 and the evaluation method employed in the code will be expected significantly different results. In CAP calculation, latter method excludes the evaluation of half of the area where other classes exist. Could you please specify precisely which evaluation method was utilized?

Code References:
In datasets.py, in the getitem function, for multi-source scenarios, instead of having 3x448x224 images, a stack operation is utilized to create a structure of 2 x 3x 224 x 224.

In model.py, in the forward function, when the image channel length is 5 (indicating the presence of stacked images), only the first image is considered.

In train.py, the validate_multi function does not utilize mixture images of size 448x224; rather, it divides them into two 224x224 images, processes them through the model, evaluates each, and averages the results.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.