Giter VIP home page Giter VIP logo

Comments (6)

lerrytang avatar lerrytang commented on May 17, 2024

This code is correct. *2 because each center position has (x, y), so the next dense layer should have k*2 inputs.

from brain-tokyo-workshop.

maraoz avatar maraoz commented on May 17, 2024

Thanks @lerrytang! I was just about to comment saying I tried removing it and the code's not working anymore :)

I guess I need to dive deeper into the attention module code to really understand what it's doing. Thanks again!

from brain-tokyo-workshop.

lerrytang avatar lerrytang commented on May 17, 2024

Again, thanks for being interested. You are always welcome for more questions :)

from brain-tokyo-workshop.

maraoz avatar maraoz commented on May 17, 2024

You are awesome @lerrytang! If you offer so generously... I'm confused by the interaction between SelfAttention and MLPSolution... As far as I can tell, the MLPSolution is only receiving the (x,y) coordinates of the best top_k patches... and making a decision on which action to take only with that info? From reading your article I thought that the agent could 'look' at those patches to make the decision, but I'm not seeing anywhere in the code where the MLPSolution is able to access the pixel info from the patches! Is this magic or what!?!?!? :P

For example, setting top_k to 1 and printing the dimension of MLPSolution's input (print('MLPSolution input', inputs, inputs.shape)), I get:

...
MLPSolution input tensor([0.9219, 0.9219]) torch.Size([2])
MLPSolution input tensor([0.4219, 0.4844]) torch.Size([2])
MLPSolution input tensor([0.9219, 0.9219]) torch.Size([2])
MLPSolution input tensor([0.4219, 0.4844]) torch.Size([2])
MLPSolution input tensor([0.8594, 0.9219]) torch.Size([2])
...

So... where is the agent actually looking at the pixel input from the patches??!?!?

from brain-tokyo-workshop.

lerrytang avatar lerrytang commented on May 17, 2024

That's the interesting part :)
The agent has 2 parts: the self-attention visual module (

) and the LSTM controller, MLPSolution is the latter (despite its name).

The former part gets all the RGB images and does the patch voting to get the K patches, this is the only place in the code where image info is looked at. After that we discard the non-important patches, extract features from these K selected ones, and feed the feature to the controller. About feature extraction, one can extract any feature from these K patches, but in our experiments, we simply used the locations and disregard the content info, so that's why you don't see any pixel processing code in MLPSolution.

Hope this helps.

from brain-tokyo-workshop.

maraoz avatar maraoz commented on May 17, 2024

OMG that's amazing!! I guess I had missed that part on the article:
image

This is even more impressive then!! ok... I need to think about what all this means. Thanks again @lerrytang !

from brain-tokyo-workshop.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.