xiaolonw / caffe-video_triplet Goto Github PK

View Code? Open in Web Editor NEW

174.0 15.0 80.0 7.15 MB

Unsupervised Learning using Videos (ICCV 2015)

License: Other

CMake 3.05% Makefile 0.67% C++ 81.66% Cuda 4.64% MATLAB 0.99% Python 8.65% Shell 0.34%

caffe-video_triplet's Introduction

caffe-video_triplet

This code is developed based on Caffe: project site.

This code is the implementation for training the siamese-triplet network in the paper:

Xiaolong Wang and Abhinav Gupta. Unsupervised Learning of Visual Representations using Videos. Proc. of IEEE International Conference on Computer Vision (ICCV), 2015. pdf

Codes

Training scripts are in rank_scripts/rank_alexnet:

For implementation, since the siamese networks share the weights, so there is only one network in prototxt.

The input of the network is pairs of image patches. For each pair of patches, they are taken as the similar patches in the same video track. We use the label to specify whether the patches come from the same video, if they come from different videos they will have different labels (it does not matter what is the number, just need to be integer). In this way, we can get the third negative patch from other pairs with different labels.

In the loss, for each pair of patches, it will try to find the third negative patch in the same batch. There are two ways to do it, one is random selection, the other is hard negative mining.

In the prototxt:

layer {		
	name: "loss"	
	type: "RankHardLoss" 	
	rank_param{		
		neg_num: 4	
		pair_size: 2 	
		hard_ratio: 0.5 	
		rand_ratio: 0.5 	
		margin: 1 	
	} 	
	bottom: "norml2" 	
	bottom: "label" 	
}

neg_num means how many negative patches you want for each pair of patches, if it is 4, that means there are 4 triplets. pair_size = 2 just means inputs are pairs of patches. hard_ratio = 0.5 means half of the negative patches are hard examples, rand_ratio = 0.5 means half of the negative patches are randomly selected. For start, you can just set rand_ratio = 1 and hard_ratio = 0. The margin for contrastive loss needs to be designed for different tasks, trying to set margin = 0.5 or 0.1 might make a difference for other tasks.

Models

We offer two models trained with our method:

color model is trained with RGB images. gray model is trained with gray images (3-channel inputs). prototxt is the prototxt for both models. mean is the mean file.

In case our server is down, the models can be downloaded from dropbox:

color model is trained with RGB images. gray model is trained with gray images (3-channel inputs).

Training Patches

The unsupervised mined patches can be downloaded from here: https://www.dropbox.com/sh/vgp2k3mdi61sdgr/AAB9vwX140jppHjp33n4UoO7a?dl=0

Each tar file contains different patches. Note that the file YouTube.tar.gz can be extracted by using "tar xf" even though it is named as "tar.gz" file.

The example of the training list can be downloaded from here: https://www.dropbox.com/s/tnbu2myy7g0i6l6/trainlist.txt?dl=0

caffe-video_triplet's People

Contributors

Stargazers

Watchers

caffe-video_triplet's Issues

The pair size must be 2?

Could it be other size,such as 5 or 10.

Could you help describe what is the difference between collect, collect2 and collect3?

triplet loss can't convergence,how can I do?

hi,
I use your code to train triplet loss completely, but,I find the loss value can't convergence, it show that loss value=margin all the time when I train face image with your model,why? expect your reply,thank you very much!

about diff

@xiaolonw
if(tloss1 > 0)
{
for(int k = 0; k < dim; k ++)
{
fori_diff[k] += (fneg[k] - fpos[k]); // / (pairNum * 1.0 - 2.0);
fpos_diff[k] += -fori[k]; // / (pairNum * 1.0 - 2.0);
fneg_diff[k] += fori[k];
}
}
if(tloss2 > 0)
{
for(int k = 0; k < dim; k ++)
{
fori_diff[k] += -fpos[k]; // / (pairNum * 1.0 - 2.0);
fpos_diff[k] += fneg[k]-fori[k]; // / (pairNum * 1.0 - 2.0);
fneg_diff[k] += fpos[k];
}
}
为什么loss1与loss2大于0时，计算梯度的方式不一致呢？
loss = f(a).f(n)-f(a).f(p)
那求偏导后，loss1>0是，diff的计算与上述loss相同，loss2>0时的计算公式是怎么得来呢？
多谢啦

Corrupted files on dropbox

Dear Wang,

every time I download the files "collect.tar" and "collect2.tar.gz" I get corrupted files, while all other files work fine. Can you please upload the correct files again?

Thanks,
Biagio Brattoli

about input image list

Is the input image list in an order like 1 1 2 2 3 3 ... ? Or randomly arranged? Will there be a duplicated file name in the list?

about Precision

Hello Mr Wang:
I have read your paper that "Unsupervised Learning of Visual Representations using Videos" and I have some precision questions about caffe-video_triplet.
In the 7 page of your paper,you tabled the mAP of many things ,and I want to figure out that is it　means that an items ,for example bus, in the video and it appeared 100 times but we only recognized it 50 times.Or does it means we recognized the bus 100 times but actually the bus only appeared 50 times and we were misticked in 50 recognization.

(If I didn't make my point clear I can describe it in Chinese,if that makes you understand easier)

Normalization

Hi,

Perhaps, too late to post, but anyway, thanks for sharing this project. Your paper and this project are really helpful.

Just wondered why the calculation of Norm layer is done by channels. I understand it does a normalization to make |f(X)| = 1. Then, the sum is supposed to be done by the whole features of an image input, but in the implementation, it is done by channels of each pixel location.

The difference may be negligible, but isn't this better?

      for(int i = 0; i < num; i ++)
      {
          const Dtype* features = bottom_data + i * dim;
          Dtype sumdata = 0;
          for(int k = 0; k < dim; k ++)
          {
              sumdata += features[k] * features[k];
          }
          // normalize each feature
          sumdata = sqrt(sumdata) + 1e-6;
          for(int k = 0; k < dim; k ++)
          {
              features[k] /= sumdata;
          }
      }

By the way, some people commented loss didn't decrease, but on my env and my own samples generated from my tracker, showed a good learning curve. Here's my branch.

tripletloss on nvidia caffe 0.15.14
https://github.com/ggsato/caffe/tree/tripletloss

How can I run a pre-trained set of weights?

I am a complete caffe noob (although familiar with neural network theory and programming more generally). I assume the pre-trained weights are color_model.caffemodel and can be used somehow to run a model on some unseen images and classify them? If so, could somebody share the steps required to do this with me, please?

Thanks!

No hardrankloss include file

can't find the header file for hardrank loss

is it a bug?

sorry i make a mistake

build caffe

./include/caffe/util/mkl_alternate.hpp:6:17: fatal error: mkl.h: No such file or directory

I have not met this problem when I use another version caffe

why do the loss (top[0] in RankHardLossLayer) not backpropagate to bottom?

another question :
is it a triplet loss?

                    if(tloss1 > 0)
        {
            for(int k = 0; k < dim; k ++)
            {
                fori_diff[k] += (fneg[k] - fpos[k]); // / (pairNum * 1.0 - 2.0);
                fpos_diff[k] += -fori[k]; // / (pairNum * 1.0 - 2.0);
                fneg_diff[k] +=  fori[k];
            }
        }
        if(tloss2 > 0)
        {
            for(int k = 0; k < dim; k ++)
            {
                fori_diff[k] += -fpos[k]; // / (pairNum * 1.0 - 2.0);
                fpos_diff[k] += fneg[k]-fori[k]; // / (pairNum * 1.0 - 2.0);
                fneg_diff[k] += fpos[k];
            }
        }

The loss value does not decrease.

Hi, xiaolong,

I am trying to use your code to extract features from pedestrian images, which are generated from a surveillance video. The input images are composed of 3 color channel and one optical flow channel. However, the loss value does not decrease any. Could you give me some suggestions?

Thanks,
Akshay

rank_hard_loss_layer.cpp

How much amount of contribution does tloss2 have in the forward process? I notice that you only have the term D(x, x-) rather than D(x+, x-) in your ICCV paper.