soundq's Introduction

SoundQ — Enhanced sound event localization and detection in real 360-degree audio-visual soundscapes.

Features

An audio-visual synthetic data generator with spatial audio and 360-degree video.
A suite of scripts to perform data_augmentation on 360-degree audio and video.
- Integrating audio channel swapping (ACS) as per Wang et al.
- Integrating video pixel swapping (VPS) as per Wang et al.
An enhanced audio-visual SELDNet model with comparable performance to the audio-only SELDNet23
- The model integrates Detic, but any other detection model can also be integrated within the training pipeline.

Installation

Results on development dataset

We benchmark our model following the DCASE Challenge 2023 Task3 SELD evaluation metric.

The following table includes only the best performing system (as documented in DCASE results). The evaluation metric scores for the test split of the development dataset is given below.

Model	Dataset	ER_20°	F_20°	LE_CD	LR_CD
AO SELDNet23 (baseline)	Ambisonic*	0.57	29.9 %	21.6°	47.7 %
AV SELDNet23 (baseline)	Ambisonic + Video	1.07	14.3 %	48.0 °	35.5 %
AV SELDNet23 (ours)	Ambisonic* + Video	0.65	24.9 %	18.7°	37.5 %

Legend: AO=audio-only, AV=audio-visual, FOA=first order ambisonics format, *=FOA + Multi-ACCDOA

Citation

If you find our work useful, please cite our paper:

@article{roman2024enhanced,
  title={Enhanced Sound Event Localization and Detection in Real 360-degree audio-visual soundscapes},
  author={Roman, Adrian S and Balamurugan, Baladithya and Pothuganti, Rithik},
  journal={arXiv preprint arXiv:2401.17129},
  year={2024}
}

soundq's People

Contributors

Stargazers

Watchers

soundq's Issues

Change the number of channels?

I have successfully generated audio and video data using the audio-visual synthetic data generator, in which the audio data is 32 channels, but the audio used in the DCASE competition is 4 channels. May I ask how to convert the channels?
Looking forward to your reply very much.

How to generat audio-visual synthetic data ??

Thank you for providing the code. I have downloaded the youtube video and tried to run audiovisual_synth.py, but it cannot run. I have some questions, and hope you can help me to answer them:

It is mentioned in the paper that METU-SPARG RIR data is used, and the files used in the code are em32(AIR data), which one should be chosen actually?
There is no file named IR_em32.wav in the METU-SPARG dataset. Why is this file loaded in the code? Does this file need to be generated extra?
Can you provide a readme for generating audio-visual synthetic data?

Looking forward to your reply very much!!!! :)

Recommend Projects

aromanusc / soundq Goto Github PK

soundq's Introduction

SoundQ — Enhanced sound event localization and detection in real 360-degree audio-visual soundscapes.

Features

Installation

Results on development dataset

Citation

soundq's People

Contributors

Stargazers

Watchers

soundq's Issues

Change the number of channels?

How to generat audio-visual synthetic data ??

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent