Giter VIP home page Giter VIP logo

soundq's Introduction

SoundQ — Enhanced sound event localization and detection in real 360-degree audio-visual soundscapes.

Features

  • An audio-visual synthetic data generator with spatial audio and 360-degree video.

  • A suite of scripts to perform data_augmentation on 360-degree audio and video.

    • Integrating audio channel swapping (ACS) as per Wang et al.

    • Integrating video pixel swapping (VPS) as per Wang et al.

  • An enhanced audio-visual SELDNet model with comparable performance to the audio-only SELDNet23

    • The model integrates Detic, but any other detection model can also be integrated within the training pipeline.

Installation

See installation instructions.

Results on development dataset

We benchmark our model following the DCASE Challenge 2023 Task3 SELD evaluation metric.

The following table includes only the best performing system (as documented in DCASE results). The evaluation metric scores for the test split of the development dataset is given below.

Model Dataset ER20° F20° LECD LRCD
AO SELDNet23 (baseline) Ambisonic* 0.57 29.9 % 21.6° 47.7 %
AV SELDNet23 (baseline) Ambisonic + Video 1.07 14.3 % 48.0 ° 35.5 %
AV SELDNet23 (ours) Ambisonic* + Video 0.65 24.9 % 18.7° 37.5 %

Legend: AO=audio-only, AV=audio-visual, FOA=first order ambisonics format, *=FOA + Multi-ACCDOA

Citation

If you find our work useful, please cite our paper:

@article{roman2024enhanced,
  title={Enhanced Sound Event Localization and Detection in Real 360-degree audio-visual soundscapes},
  author={Roman, Adrian S and Balamurugan, Baladithya and Pothuganti, Rithik},
  journal={arXiv preprint arXiv:2401.17129},
  year={2024}
}

soundq's People

Contributors

adriansroman avatar aromanusc avatar rithikp06 avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar

soundq's Issues

Change the number of channels?

I have successfully generated audio and video data using the audio-visual synthetic data generator, in which the audio data is 32 channels, but the audio used in the DCASE competition is 4 channels. May I ask how to convert the channels?
Looking forward to your reply very much.

How to generat audio-visual synthetic data ??

Thank you for providing the code. I have downloaded the youtube video and tried to run audiovisual_synth.py, but it cannot run. I have some questions, and hope you can help me to answer them:

  1. It is mentioned in the paper that METU-SPARG RIR data is used, and the files used in the code are em32(AIR data), which one should be chosen actually?

  2. There is no file named IR_em32.wav in the METU-SPARG dataset. Why is this file loaded in the code? Does this file need to be generated extra?
    image

  3. Can you provide a readme for generating audio-visual synthetic data?

Looking forward to your reply very much!!!! :)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.