Giter VIP home page Giter VIP logo

Comments (3)

mdda avatar mdda commented on June 27, 2024 1

How familiar are you with the Intrinsic Dimension paper? It's been a while, but I seem to recall the basic idea is that one can replace an existing network parameterisation "W" with one that looks like W_new = W_0 + V*W_expansion where W_0 is just some random initial state, V is a new variable (with a low dimension aka intrinsic_dimension, once we've figured out the lowest sensible size) and W_expansion is a matrix that 'expands' from the V size up to the W_0 size (and can be randomly initialised, since all we care about is that V gets to have influence across a hyperplane in the original parameter space, and that W_0 is within that plane).

We then optimise the new network, but only alter V. If we can get the network to train 'well', then we know that 'V' is big enough - so we can try smaller sizes of 'V'. At some point, the network won't train well, and we know we've gone "too far" in restricting the size of V. Just before that, the size of 'V' is what we'll call the intrinsic_dimension.

So : The IntrinsicDimensionWrapper takes a module (in the notebook I tested on a single Linear layer first, and then a whole MNIST CNN), and goes through all the parameter blocks, replacing them with their initial value, and a dependency on a single 'V'. It then cleans out all the old parameters, so that when PyTorch thinks about optimisation, it only sees 'V'.

Does this make sense? I made the notebook for a presentation I gave in Singapore, a short while after the paper came out : https://blog.mdda.net/ai/2018/05/15/presentation-at-pytorch

Hope this helps
Martin

from deep-learning-workshop.

rahulvigneswaran avatar rahulvigneswaran commented on June 27, 2024

@mdda Thank you so much for the explanation. This solves most of my doubts. In the paper, they have mentioned 3 ways of generating the random matrix (W_expansion) :

  • Dense
  • Sparse
  • Fastfood

From your code, I can understand that you went with the naive dense method for random matrix generation. You have used torch.randn to generate the random matrix of size matrix_size, but why did you divide elementwise by intrinsic_dimension and take a square root?

Also, after the wrapped is applied, the model seems to have only 1 named_parameter(), that is V and all the conv, weight, bias layers disappear from the named_parameters(). I am confused about what you are doing over there. Are you changing the architecture by any chance?

from deep-learning-workshop.

mdda avatar mdda commented on June 27, 2024

I guess I should first point out that this was hacked together just a few hours before I gave the talk...

But my self-justification for this is that if I've got a vector, and I multiply it by a matrix, there's a kind of 'impedence mismatch' in terms of scaling. To some extent, I'll be adding together things O( V_i ) * N(0,1) * size_of_V. So if the elements of V are "about the right size", then I need to downscale the matrix by the square-root of something relevant... (same would go for the attention-head factor for Transformers).

I'm not claiming this is exactly right, but the factor would be irrelevant after training anyway : I was just trying to slice off an approximate scale factor to enable easier optimisation.

from deep-learning-workshop.

Related Issues (4)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.