Comments (6)
Hi Antoine,
You're right, the randomized slicing during training works by taking a substring from the time position embedding in order to learn time position embeddings for a longer clips. For example, Here the models passt-s-f128-20sec-p16-s10-ap.474-swa.pt and
passt-s-f128-30sec-p16-s10-ap.473-swa.pt can accept as input audio clips of 20 seconds or 30 seconds, while being trained only on 10-second clips of audioset.
from passt.
Thank you for your swift reply!
Hmm... not sure I get it!
What I understood is that you have different models that can take clips in up to different lengths each (10s, 20s, 30s). Their input size varies accordingly (128x998, 128x2000, 128x3000, ...).
If I build a model with the configuration associated to passt-s-f128-20sec-p16-s10-ap.474-swa.pt, do we agree that I will only be able to fine-tune or infer on clips that are AS or LESS long than 20sec (but not more)?
In the first else, you do:
time_new_pos_embed = time_new_pos_embed[:, :, :, :x.shape[-1]]
which makes sense to me as if I pass a clip that is 10sec for example (while working with the model that's able to process up to 20sec clips) so half the size of what the model was trained on, I want to associate the time position embeddings from 0 to 10sec worth of patches.
In the second else, you do:
x = x[:, :, :, :time_new_pos_embed.shape[-1]]
which is to handle the case where the input clip is longer than what the model was trained on. So it makes sense here to trim x up to time_new_pos_embed.shape[-1]
as x is longer.
What I struggle to understand is the use of randomization at training time.
toffset = torch.randint(1 + time_new_pos_embed.shape[-1] - x.shape[-1], (1,)).item()
time_new_pos_embed = time_new_pos_embed[:, :, :, toffset:toffset + x.shape[-1]]
why are you using a random offset here? Shouldn't it work like time_new_pos_embed = time_new_pos_embed[:, :, :, :x.shape[-1]]
? I would expect to always pass embeddings starting from 0. It seems here that you could associate embeddings related to later patches, to earlier patches. Or doesn't it work like this?
Let's say we have x as :
x1 x2 x3 x4 x5
And our model is able to take in up to 10 patches:
x1 x2 x3 x4 x5 x6 x7 x8 x9 x10
Our time_new_pos_embed is initialized as:
e1 e2 e3 e4 e5 e6 e7 e8 e9 e10
Here I would associate like this:
x1+e1 x2+e2 x3+e3 x4+e4 x5+e5
But the code seems to suggest that during training, this can also happen, on a random basis:
x1+e4 x2+e5 x3+e6 x4+e7 x5+e8
or
x1+e2 x2+e3 x3+e4 x4+e5 x5+e6
etc...
Obviously the above illustration doesn't reflect the right tensors dimensions but I tried to lay down my thinking as best as I could.
Thanks a lot again in advance for your help.
Antoine
from passt.
from passt.
Ok I think I understand your logic and why you chose to do this!
I thought that when you said your models accept inference on clips up to 20s or 30s, it meant they were respectively trained on strict 20s or 30s clips. But you are saying that those models were trained on variable length clips below these limits.
Regardless, It is a bit counterintuitive for me to randomize the time position encodings as I would tend to think that if you associate an similar index encoding to a different patch time randomly each time during training, it is not gonna be able to learn any positioning relationships. Or is it?
For me, e1 should always be associated to x1, e2 to x2, and so on and so forth.
It may not be a problem for stationnary sounds for which mel-spectrograms will be similar from 0 to 10s and 10 to 20s for example but what about acoustic signatures like a plane taking-off where you'll see a distinctive evolution of harmonics through the 20s (meaning the 10 last seconds are complementary to the first 10 seconds to classify this sound)?
I am trying to think about cases where this approach may prove problematic. Sorry if it's a bit fuzzy...
Have you tried with AND without randomization?
The results you mentioned in your paper are really good so I guess it must work as is.
Cheers
Antoine
from passt.
Regardless, It is a bit counterintuitive for me to randomize the time position encodings as I would tend to think that if you associate an similar index encoding to a different patch time randomly each time during training, it is not gonna be able to learn any positioning relationships. Or is it?
The encoding are always for 10-consecutive seconds, corresponding to 10 seconds of audio. Of course, you are right it won't be as good as if you traine on 20 second input. But giving the limitation of having only 10-second training, you can this way train each crop of 10-second encodings to represent relative position. Keep in mind that audioset 10-second clips are often clipped from longer audios.
Have you tried with AND without randomization?
I did not try without randomization, because the remaing encodings (e11-e20) would not be learned during training
from passt.
Yeah sorry I thought about it multiple times and you're right.
So the 10-seconds encoding is our fixed context, so to say. So we would not perform that well for sound events that are longer than this, or that need to be longer than 10s to be "recognized". But that would likely never be the case, as only a few seconds are sufficient in most cases (ex: a few dog barks).
And training all encodings randomly make sense to have a model capable of infering on longer audio.
Thank you for your reply, as always!
from passt.
Related Issues (20)
- setup.py
- I have a problem. why convert wav to mp3? HOT 3
- difference of fine-tuning the pretrained models HOT 2
- Inference Issue HOT 2
- Getting started with a custom dataset HOT 8
- 音频事件检测
- test my own model HOT 1
- Inference on AudioSet HOT 3
- RuntimeError: stft requires the return_complex parameter be given for real inputs HOT 3
- Error when trying to pip install repo HOT 2
- Pre-trained models on ESC-50 HOT 3
- can use on 8k audio ? HOT 1
- Which config can reproduce the results in paper? HOT 1
- Fixing weights for fine-tuning? HOT 2
- From ViT models to audio HOT 7
- .net and .net_swa parameters in .ckpt file HOT 1
- Changing the depth of PASST. HOT 1
- EOF (End Of File) Error on num_workers>0 HOT 1
- Where is input normalization applied? HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from passt.