Giter VIP home page Giter VIP logo

vsepp's People

Contributors

averyma avatar fartashf avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

vsepp's Issues

Much mistake on running vocab.py

Traceback (most recent call last):
File "/Users/jiapei.fjp/Documents/python_project/vsepp/vocab.py", line 121, in
main(opt.data_path, opt.data_name)
File "/Users/jiapei.fjp/Documents/python_project/vsepp/vocab.py", line 109, in main
vocab = build_vocab(data_path, data_name, jsons=annotations, threshold=4)
File "/Users/jiapei.fjp/Documents/python_project/vsepp/vocab.py", line 79, in build_vocab
captions = from_coco_json(full_path)
File "/Users/jiapei.fjp/Documents/python_project/vsepp/vocab.py", line 46, in from_coco_json
coco = COCO(path)
File "/Users/jiapei.fjp/venv/download_urls/lib/python2.7/site-packages/pycocotools/coco.py", line 84, in init
dataset = json.load(open(annotation_file, 'r'))
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/init.py", line 290, in load
**kw)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/init.py", line 338, in loads
return _default_decoder.decode(s)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/decoder.py", line 366, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/decoder.py", line 384, in raw_decode
raise ValueError("No JSON object could be decoded")
ValueError: No JSON object could be decoded

evaluation problems

def t2i(images, captions, npts=None, measure='cosine', return_ranks=False):
"""
Text->Images (Image Search)
Images: (5N, K) matrix of images
Captions: (5N, K) matrix of captions
"""
if npts is None:
npts = int(images.shape[0] / 5)
ims = numpy.array([images[i] for i in range(0, len(images), 5)])

why divide 5?

Finish validation without running model.train()

The code in 176, 177 line of train.py:

if model.Eiters % opt.val_step == 0:  
    validate(opt, val_loader, model)

will run the validation and call model.val_start() to stop the batch_norm and dropout layers of img_encoder and txt_encoder.

However, it will not call model.train_start() until next epoch, that may cause the dropout and batch_norm layers will only activate in the first val_step steps for each epoch...

Is it a bug need to be fixed?

loss gap between train and test

in the final epoch , the training loss is much less than the test loss, is this a overfitting problem. If so , overfitting occurs in the second epoch. detail as belows

2021-12-22 08:38:09,866 Epoch: [29][3223/3234] Eit 97010 lr 2e-05 Le 17.5955 (16.2388) Time 0.054 (0.000) Data 0.037 (0.000)
2021-12-22 08:38:10,415 Epoch: [29][3233/3234] Eit 97020 lr 2e-05 Le 9.2004 (16.2394) Time 0.054 (0.000) Data 0.038 (0.000)
2021-12-22 08:38:10,455 Test: [0/40] Le 65.0238 (65.0238) Time 0.040 (0.000)
2021-12-22 08:38:10,861 Test: [10/40] Le 64.3841 (64.6037) Time 0.040 (0.000)
2021-12-22 08:38:11,264 Test: [20/40] Le 64.7672 (64.5425) Time 0.041 (0.000)
2021-12-22 08:38:11,670 Test: [30/40] Le 64.4232 (64.6411) Time 0.041 (0.000)
2021-12-22 08:38:12,840 Image to text: 42.1, 75.0, 85.0, 2.0, 8.8
2021-12-22 08:38:13,407 Text to image: 33.6, 67.4, 80.3, 3.0, 18.0

Loss stuck, not decreasing

Hi, I'm noticing a very strange loss behavior during the training phase.
Initially, the loss decreases as it should be. At a certain point, it reaches a plateau from which most of the times cannot escape.
In particular, if I use pre-extracted features without fine-tuning the image encoder, the plateau is overtaken quite immediately, as show in the following plot:
image

However, if I try to fine-tune, the loss get stuck forever:
image

I noticed that the loss stuck on a very specific value, that is 2 * (batch_size * loss_margin).
It seems that the loss is collapsing to values where the difference between positive and negative pair similarities is always 0:
equation
and
equation

I'm using margin = 0.2. For the pre-extracted features I used a batch size = 128, while for the fine-tuning the batch size = 32. The configuration is the very same as yours.
In general I noticed this behavior happening when the network is too complex.
Maybe the reason is that good hard negatives cannot be found if I use batch sizes less than 128. However, I have hardware constraints.

Did you notice a similar behavior in your experiments? If so, how did you solve?
Thank you very much

Runs file too large

Hey @fartashf

I would like to ask you if it would be possible to split the runs.tar and vocab.tar into smaller pieces per type of model.
I am trying to download them and remove the models that I do not need when I create some docker images, and I think it would make the usage much easier and lighter.

Thank you very much

Can not run the train.py

I have some trouble running the train.py, and I guess the problem is on the tensorboard_logger.

I use the pycharm to debug, and when it gets to the "import tensorboard_logger as tb_logger", the code stops and returns "Process finished with exit code 139 (interrupted by signal 11: SIGSEGV)".

When I directly use the "python train.py --data_path "$DATA_PATH" --data_name coco_precomp --logger_name runs/coco_vse++ --max_violation", it only returns "Segmentation fault (core dumped)".

I firstly guess the problem is on the installation of tensorboard_logger, so I install it again, and run "import tensorboard_logger as tb_logger" in the python situation instead of bash, there is no error occurs.

Can anyone make a help to find the problem? Thx!

@fartashf

Why take the first element of a batch after padding RNN output?

From my understanding, the code indicates that:

After padding the RNN output "out" to "padded" with batch_first=True, the first dim of "padded" should be batch_size, and then the operation "padded[0]" takes the first element of a batch. This operation is rare and hard to understand. Am I wrong? Could someone help explain the purpose of this code?

Thanks in advance.

about use dataset

Hi,I have a question about use dataset,For coco I use it to make classfication,I make images dimension to 80,then use labels to supervise it,but the performance is not good.,I will appreciate it if you could give some advices.

problem in finetune

I use the original flickr30k dataset rather than precomputed features ,and use the pretrained model resnet50.I set finetune True. But, i get poor result of recall. R@1,R@5,R@10 is nearly zero.I can't find the reason about the bad result.How do you set the parameters when you use resnet with finetune?
Thank you.

model.py dimension misalignment

There's a dimension misalignment error in the l2norm function when trying to expand the "norm" variable.

Log:

    features = l2norm(features)
  File "/media/shenkev/data/Ubuntu/vsepp/model.py", line 17, in l2norm
    X = torch.div(X, norm.expand_as(X))
  File "/usr/local/lib/python2.7/dist-packages/torch/autograd/variable.py", line 725, in expand_as
    return Expand.apply(self, (tensor.size(),))
  File "/usr/local/lib/python2.7/dist-packages/torch/autograd/_functions/tensor.py", line 111, in forward
    result = i.expand(*new_size)
RuntimeError: The expanded size of the tensor (1024) must match the existing size (128) at non-singleton dimension 1. at /pytorch/torch/lib/THC/generic/THCTensor.c:323

I think you can fix this by using unsqueeze instead of expand_as, shown below

def l2norm(X):
    """L2-normalize columns of X
    """
    norm = torch.pow(X, 2).sum(dim=1).sqrt()
    X = torch.div(X, norm.unsqueeze(1))
    return X

Reproducing results

Hi,

First of al, thanks for sharing this great work!

I've difficulties reproducing the results from the paper as a baseline. I will talk about experiment #3.15 in this issue: VSE++ (ResNet), Flickr30k

So what I get from the paper, the config is the following:

  • 30 epochs
  • Load images from disk, no precomputed features?
  • lower the lr after 15 epochs.
  • lr goes from 0.0002 -> 0.00002

My question is: is the image-encoder here trained end-to-end or not. In other words, is ResNet152 only used as a fixed feature extractor, or is it optimized?

According to your documentation, VSE++ (and therefore I assume 3.14) can be reproduced by - only - using the --max_violation flag, but I get (way) lower results, do I need the --finetune flag as well?

Thanks,
Maurits

The question about loss function

I studied your paper and codes. As I understand it, one caption-image pair is Postive sample, and the other (mini-batch size - 1) caption-image pairs are Negative sample. However, if you sample some captions which happens to belong to the same image in one mini-batch, and these pairs are considered to be Negative as your code. In fact, they should be positive samples. Does this affect the hard sample miner for the contrastive loss?
Looking forward to your reply!

Question about your model ?

I have a question about your model. I have a image with pill object and list of text is name of pill corresponding to image. I want to map pill object and name of it. Because number pill classes very huge so I know some prior infomation about classes .
I have train faster RCNN to detect pill and now i must map it with name. Is your model is relavent to this problem. Thank you so much

Batch formation potentially causing false negatives

Since for each image in MS-COCO there are 5 captions, I believe in data.py that when a batch is formed, it is possible that two or more of the images in that batch will be identical (they will just be paired with different captions). Since the ContrastiveLoss implementation assumes only the diagonal of the scores matrix represents scores for aligned images and captions, doesn't this mean it is possible for images and captions that are aligned in the dataset to be treated as unaligned when computing/ backpropagating the loss? Here is an example to illustrate this idea:

Consider a batch size of 128. Perhaps the 5th and 19th image selected in the batch are identical (the 5th and 19th captions selected are different, but describe the same image). In the scores matrix in the forward method of ContrativeLoss, the (5, 5) and (19, 19) entries will be correctly treated as scores for aligned embeddings. However, the (5, 19) and (19, 5) entries will be incorrectly treated as scores for unaligned embeddings.

Did I misunderstand anything with the code? If not, I believe this would affect the cost_s portion of ContrastiveLoss but not the cost_im portion.

single caption query

This code works quite well. Thanks for sharing it.
I'm wondering, do you have any code snippets to show how one might use a trained VSE++ model to create their own caption query from text (i.e. a string), submit it to the VSE++ model to get a single caption embedding and then search for matching images that have also mapped to the joint space using the same model?
It's easy to do the comparison once a numpy array for the caption and image embeddings in joint space are created, but it's not clear how to use your model with a brand new caption query or simply a set of CNN image features that are not part of some complete COCO/FLICKR/etc train or test set with corresponding caption/image pairs.
Thanks for any tips. I'd prefer not to rewrite everything if you already have some additional tools for this.

KeyError: 'unexpected key "cnn.classifier.1.weight" in state_dict'

Hello Fartash,

Thanks for your great code. However, I encounter a little problem when I want to load a pretrained model. The keys of the pretrained model (model_best.pth.tar) do not match with the architecture.

The keys from the pretrained model:

state = torch.load('coco_vse++_vggfull_restval_finetune/model_best.pth.tar')
print(state['model'][0].keys()) # image encoder with vgg16
> odict_keys(['cnn.features.module.0.weight', 'cnn.features.module.0.bias',
'cnn.features.module.2.weight', ...
'cnn.features.module.34.weight', 'cnn.features.module.34.bias',
'cnn.classifier.1.weight', 'cnn.classifier.1.bias',
'cnn.classifier.4.weight', 'cnn.classifier.4.bias',
'fc.weight', 'fc.bias']
)

The keys from the image encoder architecture printed from here:

print(self.state_dict().keys())
>  odict_keys(['cnn.features.module.0.weight', 'cnn.features.module.0.bias',
'cnn.features.module.2.weight', ...
'cnn.features.module.34.weight', 'cnn.features.module.34.bias',
'cnn.classifier.0.weight', 'cnn.classifier.0.bias',
'cnn.classifier.3.weight', 'cnn.classifier.3.bias',
'fc.weight', 'fc.bias']
)

This error suggests that you had a layer (which is not a nn.Linear) in the first position of your nn.Sequential, but in the torchvision implementation of vgg19 that you are using, it is not the case, and even here you are just removing the last fully connected layer.

Any help would be appreciated, thanks :)

questions on dataset construction

Hi. Thanks for your code.
1- May I ask why are you including the start and end token when constructing the caption? Since you want to encode the caption only, there is no need for it. As far as I know, the start and end token are only needed when predicting text (such as image captioning, neural machine transaltion....etc). But for your case, you just want to encode. Or does it have to do with how the evaluation metric is calculated?

2- I also have a question about the data loader. In this part:

        if self.images.shape[0] != self.length:
            self.im_div = 5
        else:
            self.im_div = 1
        # the development set for coco is large and so validation would be slow
        if data_split == 'dev':
            self.length = 5000

I understand that for the training and test splits, you are replicating the image 5 times (the number of captions per image). However, for the 'dev' split (validation after training), you are specifying 5000 only. For Flickr30k, that would still be correct (since we have 1000 validation images * 5). But for COCO, the actual validation dataset with the replication is 25K. But you are loading only a portion of it. According to how the data loader works, it will generate indices according to the length of the dataset specified in __len__. Therefore, for COCO dev set, it will generate 5000 indices, and with images[i//5], this is will retrieve only 1000 original COCO validation images. So my question, is that right to be done? What if the other samples are better? This would lead to a low validation score while it should be high.

Metrics for 1k test images on MS COCO

I am sorry to create a new issue but I think this way it would be better since the doubt might be shared by a few other people.

Just to confirm, we evaluate on the 5 folds and report the best result on one of these samples(the best performing 1k sample) or we report the average value for all 5 sets? If the result is for the best performing 1k sample, is it not kind of "pick and choose" scenario?

Honestly, this is a wonderful code base and I think this is the most likely place to find a solution to my doubts :)

How to build vocab?

Hello fartashf, your code is very helpful for me, but I am confused when I read the python script of vocab.py. When you construct the vocab of f8k_precomp, you use the train captions and valid captions. But when you construct the vocab of f8k, you use the train captions, valid captions and test captions. Could you explain it?

Question about recall @ k

Hi There! Thanks for the code :)

I'm curious about how evaluation works in this library. Based on a comment here I think that the input matrices in t2i and i2t are the same, and the image features are copied 5x. In t2i, it looks like the images are downsampled into a matrix consisting of only a single copy of each image, which makes sense for image retrieval. However, in caption retrieval, it looks like each image is used to search all of the captions, of which, there are 5 correct ones. This part of the code appears to find, among the 5 correct captions for MSCOCO, the best/lowest ranked one among the 5K possibilities.

However -- I am a bit confused because I don't think that this corresponds to the definition of R@k (e.g., for k=1,5,10)... If there are 5 correct answers, I think the maximal recall-at-1 is 1/5? Perhaps I am misunderstanding the code, e.g., how the cross-validation splits are set up, and/or what is standard for cross-modal retrieval work. What do you think?

Patch for Python3 compatibility

diff --git a/data.py b/data.py
index 913ea16..520c38d 100644
--- a/data.py
+++ b/data.py
@@ -221,13 +221,12 @@ class PrecompDataset(data.Dataset):
     def __getitem__(self, index):
         # handle the image redundancy
         img_id = index/self.im_div
-        image = torch.Tensor(self.images[img_id])
+        image = torch.Tensor(self.images[int(img_id)])
         caption = self.captions[index]
         vocab = self.vocab
 
         # Convert caption (string) to word ids.
-        tokens = nltk.tokenize.word_tokenize(
-            str(caption).lower().decode('utf-8'))
+        tokens = nltk.tokenize.word_tokenize(str(caption).lower())
         caption = []
         caption.append(vocab('<start>'))
         caption.extend([vocab(token) for token in tokens])
diff --git a/evaluation.py b/evaluation.py
index 7e5da4e..9171f85 100644
--- a/evaluation.py
+++ b/evaluation.py
@@ -57,7 +57,7 @@ class LogCollector(object):
         """Concatenate the meters in one log line
         """
         s = ''
-        for i, (k, v) in enumerate(self.meters.iteritems()):
+        for i, (k, v) in enumerate(self.meters.items()):
             if i > 0:
                 s += '  '
             s += k + ' ' + str(v)
@@ -66,7 +66,7 @@ class LogCollector(object):
     def tb_log(self, tb_logger, prefix='', step=None):
         """Log using tensorboard
         """
-        for k, v in self.meters.iteritems():
+        for k, v in self.meters.items():
             tb_logger.log_value(prefix + k, v.val, step=step)
 
 

train on synthetic dataset

Image to text: R@1 is bad and it fluatuates around (1.5)
Text to image: R@1 is bad and it fluatuates around (0.5)

my vocabulary size is small (~100 words).

do you think training my own 'resnet' image/text encoder on synthetic images would help?

Much appreciate it!

wonderful codes

wonderful codes!!!
line 214-218 in model.py may be modified to:
out = torch.unsqueeze(torch.squeeze(_), 1)

RuntimeError: mat1 and mat2 shapes cannot be multiplied (4608x2048 and 4096x1024)

Hello, I encountered the following problems during the test:
RuntimeError: mat1 and mat2 shapes cannot be multiplied (4608x2048 and 4096x1024)
I checked the parameters of the model and found that 1024 is emb_size, 4096 is img_size, and 4608x2048 refers to the size of the [128, 36, 2048] image.
How can I solve this problem?

meanr and rsum seem inversely correlated

Hi @fartashf,

Great code base, very easy to work with! I had two quick questions regarding evaluation metrics:

  • I noticed that meanr increases as rsum increases and was wondering if you had an explanation for this? (See plots below) I should mention that these results are with a few modifications to your code. Specifically, I used Glove embeddings where the embeddings are not trained / "backproped" into.

  • Also was wondering what was the reason for choosing rsum for model selection instead of meanr?

screen shot 2019-01-17 at 12 48 21 pm

screen shot 2019-01-17 at 12 48 43 pm

Thanks!

Train on other language dataset

Hi, thanks for releasing the great work!
Sorry to bother you, but can this model be applied to other languages' caption datasets ?

Thanks for your reply!

precomputed dataset

Hi, First of all, it is very helpful and grateful to open your code.
I have a question about your precomputed dataset.

For training, vse model, for coco dataset, code use some data augmentation methods.

t_list = [transforms.RandomResizedCrop(opt.crop_size),
transforms.RandomHorizontalFlip()]

Did you consider the data augmentation while making a precomputed dataset?

Thank you

FileNotFoundError when try to reproduce results of pretrained model

Hey,
I had this issue when follow your readme to evaluate the pretrained model. I've set up the env variable, and echo $RUN_PATH/ in terminal show the right path.

>>> evaluation.evalrank("$RUN_PATH/coco_vse++/model_best.pth.tar", data_path="$DATA_PATH", split="test")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/mnt/localdata/geng/lib/vsepp/evaluation.py", line 137, in evalrank
    checkpoint = torch.load(model_path, map_location=torch.device(device))
  File "/mnt/localdata/geng/anaconda3/envs/vsepp/lib/python3.7/site-packages/torch/serialization.py", line 581, in load
    with _open_file_like(f, 'rb') as opened_file:
  File "/mnt/localdata/geng/anaconda3/envs/vsepp/lib/python3.7/site-packages/torch/serialization.py", line 230, in _open_file_like
    return _open_file(name_or_buffer, mode)
  File "/mnt/localdata/geng/anaconda3/envs/vsepp/lib/python3.7/site-packages/torch/serialization.py", line 211, in __init__
    super(_open_file, self).__init__(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: '$RUN_PATH/coco_vse++/model_best.pth.tar'

Missing some files to run certain models

Seems like I'm missing some files to run certain models. I can run coco_vse++ but when I try to use coco_vse0_resnet_restval_finetune, I get:

IOError: [Errno 2] No such file or directory: './data/coco/annotations/captions_val2014.json'

I indeed don't have this file, I didn't seem to get it from you downloads you listed.

problem in evaluation

I have some problem in evaluation, and need some help !

I can run the evaluation of VSE++(1C) following the instruction in the README, and it gives out the result of 1k test images.

When I run the evaluation on my own trained model on VSE++(RC), it by default test on 5k images like below:

Computing results...
Test: [0/196] Le 62.9998 (62.9998) Time 3.004 (0.000)
Test: [10/196] Le 63.0371 (64.4791) Time 0.263 (0.000)
Test: [20/196] Le 64.0825 (64.2898) Time 0.259 (0.000)
Test: [30/196] Le 64.6391 (64.2518) Time 0.263 (0.000)
Test: [40/196] Le 64.9525 (64.1494) Time 0.260 (0.000)
Test: [50/196] Le 61.6837 (64.0605) Time 0.263 (0.000)
Test: [60/196] Le 62.2307 (64.0485) Time 0.262 (0.000)
Test: [70/196] Le 62.4721 (63.8480) Time 0.273 (0.000)
Test: [80/196] Le 64.6234 (63.8591) Time 0.268 (0.000)
Test: [90/196] Le 62.8796 (63.8349) Time 0.261 (0.000)
Test: [100/196] Le 61.5295 (63.8687) Time 0.262 (0.000)
Test: [110/196] Le 63.7632 (63.8381) Time 0.260 (0.000)
Test: [120/196] Le 66.5532 (63.8105) Time 0.263 (0.000)
Test: [130/196] Le 64.4752 (63.8554) Time 0.262 (0.000)
Test: [140/196] Le 67.8941 (63.8907) Time 0.261 (0.000)
Test: [150/196] Le 61.7584 (63.8451) Time 0.264 (0.000)
Test: [160/196] Le 66.8306 (63.8695) Time 0.264 (0.000)
Test: [170/196] Le 62.1828 (63.8233) Time 0.266 (0.000)
Test: [180/196] Le 64.8306 (63.7709) Time 0.263 (0.000)
Test: [190/196] Le 59.2744 (63.8022) Time 0.264 (0.000)
Images: 5000, Captions: 25000

How can I use 1k test images to test rather than 5k?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.