Giter VIP home page Giter VIP logo

caffe-video_triplet's Introduction

caffe-video_triplet

This code is developed based on Caffe: project site.

This code is the implementation for training the siamese-triplet network in the paper:

Xiaolong Wang and Abhinav Gupta. Unsupervised Learning of Visual Representations using Videos. Proc. of IEEE International Conference on Computer Vision (ICCV), 2015. pdf

Codes

Training scripts are in rank_scripts/rank_alexnet:

For implementation, since the siamese networks share the weights, so there is only one network in prototxt.

The input of the network is pairs of image patches. For each pair of patches, they are taken as the similar patches in the same video track. We use the label to specify whether the patches come from the same video, if they come from different videos they will have different labels (it does not matter what is the number, just need to be integer). In this way, we can get the third negative patch from other pairs with different labels.

In the loss, for each pair of patches, it will try to find the third negative patch in the same batch. There are two ways to do it, one is random selection, the other is hard negative mining.

In the prototxt:

layer {		
	name: "loss"	
	type: "RankHardLoss" 	
	rank_param{		
		neg_num: 4	
		pair_size: 2 	
		hard_ratio: 0.5 	
		rand_ratio: 0.5 	
		margin: 1 	
	} 	
	bottom: "norml2" 	
	bottom: "label" 	
}

neg_num means how many negative patches you want for each pair of patches, if it is 4, that means there are 4 triplets. pair_size = 2 just means inputs are pairs of patches. hard_ratio = 0.5 means half of the negative patches are hard examples, rand_ratio = 0.5 means half of the negative patches are randomly selected. For start, you can just set rand_ratio = 1 and hard_ratio = 0. The margin for contrastive loss needs to be designed for different tasks, trying to set margin = 0.5 or 0.1 might make a difference for other tasks.

Models

We offer two models trained with our method:

color model is trained with RGB images. gray model is trained with gray images (3-channel inputs). prototxt is the prototxt for both models. mean is the mean file.

In case our server is down, the models can be downloaded from dropbox:

color model is trained with RGB images. gray model is trained with gray images (3-channel inputs).

Training Patches

The unsupervised mined patches can be downloaded from here: https://www.dropbox.com/sh/vgp2k3mdi61sdgr/AAB9vwX140jppHjp33n4UoO7a?dl=0

Each tar file contains different patches. Note that the file YouTube.tar.gz can be extracted by using "tar xf" even though it is named as "tar.gz" file.

The example of the training list can be downloaded from here: https://www.dropbox.com/s/tnbu2myy7g0i6l6/trainlist.txt?dl=0

caffe-video_triplet's People

Contributors

xiaolonw avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

caffe-video_triplet's Issues

triplet loss can't convergence,how can I do?

hi,
I use your code to train triplet loss completely, but,I find the loss value can't convergence, it show that loss value=margin all the time when I train face image with your model,why? expect your reply,thank you very much!

about diff

@xiaolonw
if(tloss1 > 0)
{
for(int k = 0; k < dim; k ++)
{
fori_diff[k] += (fneg[k] - fpos[k]); // / (pairNum * 1.0 - 2.0);
fpos_diff[k] += -fori[k]; // / (pairNum * 1.0 - 2.0);
fneg_diff[k] += fori[k];
}
}
if(tloss2 > 0)
{
for(int k = 0; k < dim; k ++)
{
fori_diff[k] += -fpos[k]; // / (pairNum * 1.0 - 2.0);
fpos_diff[k] += fneg[k]-fori[k]; // / (pairNum * 1.0 - 2.0);
fneg_diff[k] += fpos[k];
}
}
为什么loss1与loss2大于0时,计算梯度的方式不一致呢?
loss = f(a).f(n)-f(a).f(p)
那求偏导后,loss1>0是,diff的计算与上述loss相同,loss2>0时的计算公式是怎么得来呢?
多谢啦

Corrupted files on dropbox

Dear Wang,

every time I download the files "collect.tar" and "collect2.tar.gz" I get corrupted files, while all other files work fine. Can you please upload the correct files again?

Thanks,
Biagio Brattoli

about input image list

Is the input image list in an order like 1 1 2 2 3 3 ... ? Or randomly arranged? Will there be a duplicated file name in the list?

about Precision

Hello Mr Wang:
I have read your paper that "Unsupervised Learning of Visual Representations using Videos" and I have some precision questions about caffe-video_triplet.
In the 7 page of your paper,you tabled the mAP of many things ,and I want to figure out that is it means that an items ,for example bus, in the video and it appeared 100 times but we only recognized it 50 times.Or does it means we recognized the bus 100 times but actually the bus only appeared 50 times and we were misticked in 50 recognization.

(If I didn't make my point clear I can describe it in Chinese,if that makes you understand easier)

Normalization

Hi,

Perhaps, too late to post, but anyway, thanks for sharing this project. Your paper and this project are really helpful.

Just wondered why the calculation of Norm layer is done by channels. I understand it does a normalization to make |f(X)| = 1. Then, the sum is supposed to be done by the whole features of an image input, but in the implementation, it is done by channels of each pixel location.

The difference may be negligible, but isn't this better?

      for(int i = 0; i < num; i ++)
      {
          const Dtype* features = bottom_data + i * dim;
          Dtype sumdata = 0;
          for(int k = 0; k < dim; k ++)
          {
              sumdata += features[k] * features[k];
          }
          // normalize each feature
          sumdata = sqrt(sumdata) + 1e-6;
          for(int k = 0; k < dim; k ++)
          {
              features[k] /= sumdata;
          }
      }

By the way, some people commented loss didn't decrease, but on my env and my own samples generated from my tracker, showed a good learning curve. Here's my branch.

tripletloss on nvidia caffe 0.15.14
https://github.com/ggsato/caffe/tree/tripletloss

How can I run a pre-trained set of weights?

I am a complete caffe noob (although familiar with neural network theory and programming more generally). I assume the pre-trained weights are color_model.caffemodel and can be used somehow to run a model on some unseen images and classify them? If so, could somebody share the steps required to do this with me, please?

Thanks!

build caffe

./include/caffe/util/mkl_alternate.hpp:6:17: fatal error: mkl.h: No such file or directory

I have not met this problem when I use another version caffe

why do the loss (top[0] in RankHardLossLayer) not backpropagate to bottom?

another question :
is it a triplet loss?

                    if(tloss1 > 0)
        {
            for(int k = 0; k < dim; k ++)
            {
                fori_diff[k] += (fneg[k] - fpos[k]); // / (pairNum * 1.0 - 2.0);
                fpos_diff[k] += -fori[k]; // / (pairNum * 1.0 - 2.0);
                fneg_diff[k] +=  fori[k];
            }
        }
        if(tloss2 > 0)
        {
            for(int k = 0; k < dim; k ++)
            {
                fori_diff[k] += -fpos[k]; // / (pairNum * 1.0 - 2.0);
                fpos_diff[k] += fneg[k]-fori[k]; // / (pairNum * 1.0 - 2.0);
                fneg_diff[k] += fpos[k];
            }
        }

The loss value does not decrease.

Hi, xiaolong,

I am trying to use your code to extract features from pedestrian images, which are generated from a surveillance video. The input images are composed of 3 color channel and one optical flow channel. However, the loss value does not decrease any. Could you give me some suggestions?

Description of files with mined patches

Hi can you quickly tell me what's the source of data in each file.
E.g. I could open YouTube.tar.gz files and they contain patches in directory structure. However what do collect2 , 3 and scale contain?

collect2.tar.gz
collect3.tar
collect_scale.tar.gz
collect.tar
YouTube.tar.gz

Thanks,
Akshay

rank_hard_loss_layer.cpp

How much amount of contribution does tloss2 have in the forward process? I notice that you only have the term D(x, x-) rather than D(x+, x-) in your ICCV paper.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.