Giter VIP home page Giter VIP logo

Comments (6)

kkoutini avatar kkoutini commented on May 28, 2024

Hi Antoine,
You're right, the randomized slicing during training works by taking a substring from the time position embedding in order to learn time position embeddings for a longer clips. For example, Here the models passt-s-f128-20sec-p16-s10-ap.474-swa.pt and
passt-s-f128-30sec-p16-s10-ap.473-swa.pt can accept as input audio clips of 20 seconds or 30 seconds, while being trained only on 10-second clips of audioset.

from passt.

Antoine101 avatar Antoine101 commented on May 28, 2024

Thank you for your swift reply!

Hmm... not sure I get it!

What I understood is that you have different models that can take clips in up to different lengths each (10s, 20s, 30s). Their input size varies accordingly (128x998, 128x2000, 128x3000, ...).

If I build a model with the configuration associated to passt-s-f128-20sec-p16-s10-ap.474-swa.pt, do we agree that I will only be able to fine-tune or infer on clips that are AS or LESS long than 20sec (but not more)?

In the first else, you do:
time_new_pos_embed = time_new_pos_embed[:, :, :, :x.shape[-1]]
which makes sense to me as if I pass a clip that is 10sec for example (while working with the model that's able to process up to 20sec clips) so half the size of what the model was trained on, I want to associate the time position embeddings from 0 to 10sec worth of patches.

In the second else, you do:
x = x[:, :, :, :time_new_pos_embed.shape[-1]]
which is to handle the case where the input clip is longer than what the model was trained on. So it makes sense here to trim x up to time_new_pos_embed.shape[-1] as x is longer.

What I struggle to understand is the use of randomization at training time.

toffset = torch.randint(1 + time_new_pos_embed.shape[-1] - x.shape[-1], (1,)).item()
time_new_pos_embed = time_new_pos_embed[:, :, :, toffset:toffset + x.shape[-1]]

why are you using a random offset here? Shouldn't it work like time_new_pos_embed = time_new_pos_embed[:, :, :, :x.shape[-1]] ? I would expect to always pass embeddings starting from 0. It seems here that you could associate embeddings related to later patches, to earlier patches. Or doesn't it work like this?

Let's say we have x as :
x1 x2 x3 x4 x5
And our model is able to take in up to 10 patches:
x1 x2 x3 x4 x5 x6 x7 x8 x9 x10
Our time_new_pos_embed is initialized as:
e1 e2 e3 e4 e5 e6 e7 e8 e9 e10
Here I would associate like this:
x1+e1 x2+e2 x3+e3 x4+e4 x5+e5
But the code seems to suggest that during training, this can also happen, on a random basis:
x1+e4 x2+e5 x3+e6 x4+e7 x5+e8
or
x1+e2 x2+e3 x3+e4 x4+e5 x5+e6
etc...

Obviously the above illustration doesn't reflect the right tensors dimensions but I tried to lay down my thinking as best as I could.

Thanks a lot again in advance for your help.

Antoine

from passt.

kkoutini avatar kkoutini commented on May 28, 2024

from passt.

Antoine101 avatar Antoine101 commented on May 28, 2024

Ok I think I understand your logic and why you chose to do this!

I thought that when you said your models accept inference on clips up to 20s or 30s, it meant they were respectively trained on strict 20s or 30s clips. But you are saying that those models were trained on variable length clips below these limits.

Regardless, It is a bit counterintuitive for me to randomize the time position encodings as I would tend to think that if you associate an similar index encoding to a different patch time randomly each time during training, it is not gonna be able to learn any positioning relationships. Or is it?
For me, e1 should always be associated to x1, e2 to x2, and so on and so forth.
It may not be a problem for stationnary sounds for which mel-spectrograms will be similar from 0 to 10s and 10 to 20s for example but what about acoustic signatures like a plane taking-off where you'll see a distinctive evolution of harmonics through the 20s (meaning the 10 last seconds are complementary to the first 10 seconds to classify this sound)?
I am trying to think about cases where this approach may prove problematic. Sorry if it's a bit fuzzy...

Have you tried with AND without randomization?
The results you mentioned in your paper are really good so I guess it must work as is.

Cheers

Antoine

from passt.

kkoutini avatar kkoutini commented on May 28, 2024

Regardless, It is a bit counterintuitive for me to randomize the time position encodings as I would tend to think that if you associate an similar index encoding to a different patch time randomly each time during training, it is not gonna be able to learn any positioning relationships. Or is it?

The encoding are always for 10-consecutive seconds, corresponding to 10 seconds of audio. Of course, you are right it won't be as good as if you traine on 20 second input. But giving the limitation of having only 10-second training, you can this way train each crop of 10-second encodings to represent relative position. Keep in mind that audioset 10-second clips are often clipped from longer audios.

Have you tried with AND without randomization?

I did not try without randomization, because the remaing encodings (e11-e20) would not be learned during training

from passt.

Antoine101 avatar Antoine101 commented on May 28, 2024

Yeah sorry I thought about it multiple times and you're right.
So the 10-seconds encoding is our fixed context, so to say. So we would not perform that well for sound events that are longer than this, or that need to be longer than 10s to be "recognized". But that would likely never be the case, as only a few seconds are sufficient in most cases (ex: a few dog barks).
And training all encodings randomly make sense to have a model capable of infering on longer audio.

Thank you for your reply, as always!

from passt.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.