Hi! First of all, I think this tool is amazing as it does help improve quality whe

Quality not up to par with the demo maybe. about zero_shot_audio_source_separation HOT 1 CLOSED

retrocirce commented on May 29, 2024

Quality not up to par with the demo maybe.

from zero_shot_audio_source_separation.

Comments (1)

RetroCirce commented on May 29, 2024

Hi,
Thank you for your question. I will reply to your questions below.

zeroshot_asp_full.ckpt and zeroshot_asp_held_out.ckpt:

The full ckpt is the checkpoint we train with AudioSet full set. And we trained another model by holding out several classes in AudioSet to perform another experiment in our paper. So usually the full ckpt is the better one because it has full classes during the training.

(other, vocals, bass, drums) same output:

I think you can look at the "test_key" variable in the "config.py", if you use the "inference" mode. you need to prepare a mixture audio (you want to separate), and a set of query audios (the source you indicate), there are two inference variables in the config.py you might take a look. After you fill in them, you need to change the "test_key" to be only one name like ["violin"], it just indicate the source you want to separate. And notice that the inference mode can only separate one source one time. But you may change a code a little bit to make it support separating multiple sources one time (I will mark it as one request to realize it).

If you are using the "test" model (i.e. musdb mode), you don't need to indicate the query. But you need to set the testavg_path and the testset_path. They can separate the mixture to drum, bass, vocal and other.

So if you are using the "inference" mode, but you set the test_key to be four keys, you will get the same output because you only have one query. The name of the test_key does not indicate the source to separate, it is just a name

only 10 seconds?

Actually no, the separate model support any length on query and mixture, usually we cut them into small pieces one-by-one and concate the result together. The 10 second limitation is only shown in the sound event detection system during the training time because we think this is a large length to support the audio classification. But you can change it to other length by your need. It does not affect much unless you change it to much larger or shorter length (like 1 sec or 100 sec)

last question what would you recommend setting up in configs for the best quality possible

One of the possible best query we think can have the best separation results is to use the query in the mixture audio. For example, you have a mixture audio with the violin lead. But you notice that there might be about 1-2 sec solo violin in the mixture. You can extract it as the query. This usually works the best since they are the most close timbre and acoustical feelling with the mixture (they are originally the same). Another choice is that if you don't have this solo part, you can collect other violin samples as many as possible (like 50 pieces). This is how we do in testing musDB, where we collect 100 samples from its training set for constructing vocal, bass, drum, and other latent query.

Hope above information will clarify your question.

Thanks!!

from zero_shot_audio_source_separation.

Quality not up to par with the demo maybe. about zero_shot_audio_source_separation HOT 1 CLOSED

Comments (1)

Related Issues (18)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent