Hi, thanks for your code! I have some questions about the model. When we construct

Some questions about the prototype matrix about xpronet HOT 2 OPEN

markin-wang commented on August 11, 2024

Some questions about the prototype matrix

from xpronet.

Comments (2)

Markin-Wang commented on August 11, 2024

Hi, thanks for your interest.

The cross-modal prototype matrix is to learn and record the cross-modal patterns for each class rather than the image/sentence features for some specific samples. Therefore, the cross-modal prototype is initialized from several clustered, concatenated features (high level statistics, extract the common pattern) instead of the representations from some specific samples. Note that those cross-modal patterns are designed for class-level rather than instance-level. Hence, they can be applied for both the images/sectences or tokens (fine-grained) within the same class. For example, some cross-modal patterns may guide the model how to decribe the content (style, detailed or brief).

In addition, some learned cross-modal patterns can also be fine-grained as samples within the class are grouped, hence each group may focus more on some parts of sentence or patches ( imagine you group ten different type of cars from the car category, what the model will fcous?)

Moreover, the initilization is to ensure that the cross-modal prototype matrix has a good semantic information at the begining. Through the design of cross-moal prototype quering and corresponding and the contrastive learning, the model will learn what patterns should be learned and recorded, and optimize the cross-modal matrix during the training.

Hope this can help you figure out the problem

from xpronet.

Eldo-rado commented on August 11, 2024

Hi, thanks for your interest.

The cross-modal prototype matrix is to learn and record the cross-modal patterns for each class rather than the image/sentence features for some specific samples. Therefore, the cross-modal prototype is initialized from several clustered, concatenated features (high level statistics, extract the common pattern) instead of the representations from some specific samples. Note that those cross-modal patterns are designed for class-level rather than instance-level. Hence, they can be applied for both the images/sectences or tokens (fine-grained) within the same class. For example, some cross-modal patterns may guide the model how to decribe the content (style, detailed or brief).

In addition, some learned cross-modal patterns can also be fine-grained as samples within the class are grouped, hence each group may focus more on some parts of sentence or patches ( imagine you group ten different type of cars from the car category, what the model will fcous?)

Moreover, the initilization is to ensure that the cross-modal prototype matrix has a good semantic information at the begining. Through the design of cross-moal prototype quering and corresponding and the contrastive learning, the model will learn what patterns should be learned and recorded, and optimize the cross-modal matrix during the training.

Hope this can help you figure out the problem

Thank you for your reply. I almost understand. Can it be summarized in the following three points？

While the cluster is formed from instances, what it represents has risen to "categories".
In the clustering process, those instances/samples with large fine-grained similarity tend to cluster together. Thus, at initialization time, fine-grained information is included in the prototype.

3.This will be further optimized in subsequent training.

In addition, is there some ambiguity regarding the representation of r_j^s in the following figure in the paper? The subscript of r in Figure 1 represents a certain patches; The subscript of r in Figure 2 represents a certain sample. Maybe the r in Figure 2 should be bold? I don't know if I understand correctly.

Fig.1

Fig.2

from xpronet.

Some questions about the prototype matrix about xpronet HOT 2 OPEN

Comments (2)

Related Issues (12)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent