alfredxiangwu / lightcnn Goto Github PK

View Code? Open in Web Editor NEW

1.0K 27.0 166.0 20 KB

A Light CNN for Deep Face Representation with Noisy Labels, TIFS 2018

Home Page: https://arxiv.org/abs/1511.02683

License: MIT License

Python 100.00%

pytorch face-recognition lightcnn

lightcnn's Introduction

Light CNN for Deep Face Recognition, in PyTorch

A PyTorch implementation of A Light CNN for Deep Face Representation with Noisy Labels from the paper by Xiang Wu, Ran He, Zhenan Sun and Tieniu Tan. The official and original Caffe code can be found here.

Updates
Installation
Datasets
Training
Evaluate
Performance
Citation
References

Updates

Feb 9, 2022
- Light CNN v4 pretrained model is released.
Jan 17, 2018
- Light CNN-29 v2 model and training code are released. The 100% - EER on LFW achieves 99.43%.
- The performance of set 1 on MegaFace achieves 76.021% for rank-1 accuracy and 89.740% for TPR@FAR=10^-6.
Sep 12, 2017
- Light CNN-29 model and training code are released. The 100% - EER on LFW achieves 99.40%.
- The performance of set 1 on MegaFace achieves 72.704% for rank-1 accuracy and 85.891% for TPR@FAR=10^-6.
Jul 12, 2017
- Light CNN-9 model and training code are released. The 100% - EER on LFW obtains 98.70%.
- The performance of set 1 on MegaFace achieves 65.782% for rank-1 accuracy and 76.288% for TPR@FAR=10^-6.
Jul 4, 2017
- The repository was built.

Installation

Install pytorch following the website.
Clone this repository.
- Note: We currently only run it on Python 2.7.

Datasets

Download face dataset such as CASIA-WebFace, VGG-Face and MS-Celeb-1M.
- The MS-Celeb-1M clean list is uploaded: Baidu Yun, Google Drive.
All face images are converted to gray-scale images and normalized to 144x144 according to landmarks.
According to the five facial points, we not only rotate two eye points horizontally but also set the distance between the midpoint of eyes and the midpoint of mouth(ec_mc_y), and the y axis of midpoint of eyes(ec_y) .
The aligned LFW images are uploaded on Baidu Yun.

Dataset size ec_mc_y ec_y

Training set 144x144 48 48

Testing set 128x128 48 40

Dataset	size	ec_mc_y	ec_y
Training set	144x144	48	48
Testing set	128x128	48	40

Training

To train Light CNN using the train script simply specify the parameters listed in train.py as a flag or manually change them.

python train.py --root_path=/path/to/your/datasets/ \
		--train_list=/path/to/your/train/list.txt \
		--val_list=/path/to/your/val/list.txt \
		--save_path=/path/to/your/save/path/ \
		--model="LightCNN-9/LightCNN-29" --num_classes=n

Tips:
- The lists of train and val datasets are followed by the format of caffe. The details of data loader is shown in load_imglist.py. Or you can use torchvision.datasets.ImageFolder to load your datasets.
- The num_classes denotes the number of identities in your training dataset.
- When training by pytorch, you can set a larger learning rate than caffe and it is faster converaged by pytorch than caffe for Light CNN.
- We enlarge the learning rate for the parameters of fc2 which may lead better performance. If the training is collapsed on your own datasets, you can decrese it.
- We modify the implementation of SGD with momentum since the official pytorch implementation is different from Sutskever et. al. The details are shown in here.
- The training datasets for LightCNN-29v2 are CASIA-WebFace and MS-Celeb-1M, therefore, the num_classes is 80013.

Evaluation

To evaluate a trained network:

python extract_features.py --resume=/path/to/your/model \
			   --root_path=/path/to/your/datasets/ \
			   --img_list=/path/to/your/list.txt \
			   --save_path=/path/to/your/save/path/ \
			   --model="LightCNN-9/LightCNN-29/LightCNN-29v2"\
			   --num_classes=n (79077 for LightCNN-9/LightCNN-29, 80013 for LightCNN-29v2)

You can use vlfeat or sklearn to evaluate the features on ROC and obtain EER and TPR@FPR for your testing datasets.
The model of LightCNN-9 is released on Google Drive.
- Note that the released model contains the whole state of the light CNN module and optimizer. The details of loading model can be found in train.py.
The model of LightCNN-29 is released on Google Drive.
The model of LightCNN-29 v2 is released on Google Drive.
The features of lfw and megaface of LightCNN-9 are released.
The model of LightCNN v4 is released on Google Drive.
- The detailed structure of LightCNN v4 is shown in light_cnn_v4.py
- The input is an aligned 128*128 BGR face image.
- The input pixel value is normalized by mean ([0.0, 0.0, 0.0]) and std ([255.0, 255.0, 255.0]).

Performance

The Light CNN performance on lfw 6,000 pairs.

Model	100% - EER	TPR@FAR=1%	TPR@FAR=0.1%	TPR@FAR=0
LightCNN-9	98.70%	98.47%	95.13%	89.53%
LightCNN-29	99.40%	99.43%	98.67%	95.70%
LightCNN-29v2	99.43%	99.53%	99.30%	96.77%
LightCNN v4	99.67%	99.67%	99.57%	99.27%

The Light CNN performance on lfw BLUFR protocols

Model	VR@FAR=0.1%	DIR@FAR=1%
LightCNN-9	96.80%	83.06%
LightCNN-29	98.95%	91.33%
LightCNN-29v2	99.41%	94.43%

The Light CNN performance on MegaFace

Model	Rank-1	TPR@FAR=1e-6
LightCNN-9	65.782%	76.288%
LightCNN-29	72.704%	85.891%
LightCNN-29v2	76.021%	89.740%

Citation

If you use our models, please cite the following paper:

@article{wu2018light,
  title={A light CNN for deep face representation with noisy labels},
  author={Wu, Xiang and He, Ran and Sun, Zhenan and Tan, Tieniu},
  journal={IEEE Transactions on Information Forensics and Security},
  volume={13},
  number={11},
  pages={2884--2896},
  year={2018},
  publisher={IEEE}
}

References

lightcnn's People

Contributors

Stargazers

Watchers

Forkers

benjamesbabala andyhx dongburen yichuan9527 limitmhw wyc2015fq saadmahboob polysider clcarwin swordcheng yxu0611 sid-verma ayg-dl medivhna yudie433 tpys kixiang image-amazing baileyqbb zengjianping zsivine zhly0 hyer ganghu1993 lijian8 libohit xialuxi xxradon doriswzg eric-zhang1990 ztwe alpscv amazefan jdsgomes meriki nicole1990 warmstar1986 porcofly qfdong cynthia nanyangye facerless jangocheng jackeywang777 hanqing09 alvinlxs tobyclh bigrlab zhangxujinsh pengfeike reaneyli dsp6414 ml-lab locussam veronikavasilyeva houxueliang souyoungjin dearleiii xggiou perryshao jinlonghe shubaozhang jiangminmin haoyu-bu raghavamodhugu donrv jingang-cv tang1485 zhihelu chen849157649 zhaoluo jacke121 gds101054108 xiangyuwu zgsxwsdxg feiward qiaokangqi cosmoshua guanglingsun1 jijijiang popmeshgrid conansherry siyue0211 lemnzhou gaimjkp akshayjh dtennant ljthink sunshouqiang keivanb nethorse happyconan honzys facex-team-for-learning wangpuyang wings-zhang-123 for-research seansyue jolt2017 pulkitgarg67

lightcnn's Issues

LightCNN_9Layers_checkpoint.pth.tar and 29Layer are wrong tar file

Sorry, I treat it as a tar file

the 29layer converge problem

hello ,wuxiang,from your result,the 29layers model did a good job in FR .from the python script,it seems that your 29layers lightcnn model is made frome inserting several resnet blocks in the original 9layers model...I use it to training face model,,,but the caffe loss is converge to 10....i is seems that the model didn't training well...my question is using caffe,the loss can reach how small..

high memory usage low GPU util

Hello,

I run the "train.py". But I found that my GPU has a high memory usage with a low GPU util. Could you please help me about that? My worker number is 32. Thank you!

loss change to nan when training on a new dataset

Hi, @AlfredXiangWu !

When I try to train LightCNN with my dataset, the loss change to nan, the logs below:

lr: 0.01
Epoch: [0][0/2402]      Time 10.607 (10.607)    Data 0.372 (0.372)      Loss 8.5419 (8.5419)    Prec@1 0.000 (0.000)    Prec@5 0.000 (0.000)
Epoch: [0][100/2402]    Time 0.086 (0.191)      Data 0.000 (0.004)      Loss 8.6579 (8.5740)    Prec@1 0.000 (0.000)    Prec@5 0.000 (0.232)
Epoch: [0][200/2402]    Time 0.087 (0.139)      Data 0.000 (0.002)      Loss 8.4459 (8.6040)    Prec@1 0.000 (0.000)    Prec@5 0.000 (0.117)
Epoch: [0][300/2402]    Time 0.086 (0.122)      Data 0.000 (0.002)      Loss 8.7730 (8.6280)    Prec@1 0.000 (0.000)    Prec@5 0.000 (0.078)
Epoch: [0][400/2402]    Time 0.087 (0.113)      Data 0.000 (0.001)      Loss 8.6854 (8.6500)    Prec@1 0.000 (0.000)    Prec@5 0.000 (0.058)
Epoch: [0][500/2402]    Time 0.088 (0.108)      Data 0.000 (0.001)      Loss 8.6538 (8.6725)    Prec@1 0.000 (0.000)    Prec@5 0.000 (0.047)
Epoch: [0][600/2402]    Time 0.093 (0.105)      Data 0.000 (0.001)      Loss 8.6870 (8.6952)    Prec@1 0.000 (0.000)    Prec@5 0.000 (0.039)
Epoch: [0][700/2402]    Time 0.086 (0.103)      Data 0.000 (0.001)      Loss 8.8851 (8.7174)    Prec@1 0.000 (0.000)    Prec@5 0.000 (0.033)
Epoch: [0][800/2402]    Time 0.092 (0.101)      Data 0.000 (0.001)      Loss 9.0064 (8.7384)    Prec@1 0.000 (0.000)    Prec@5 0.000 (0.029)
Epoch: [0][900/2402]    Time 0.090 (0.100)      Data 0.000 (0.001)      Loss 8.7320 (8.7574)    Prec@1 0.000 (0.000)    Prec@5 0.000 (0.026)
Epoch: [0][1000/2402]   Time 0.088 (0.099)      Data 0.000 (0.001)      Loss 8.8482 (8.7760)    Prec@1 0.000 (0.000)    Prec@5 0.000 (0.023)
Epoch: [0][1100/2402]   Time 0.088 (0.098)      Data 0.000 (0.001)      Loss 8.9686 (8.7938)    Prec@1 0.000 (0.000)    Prec@5 0.000 (0.021)
Epoch: [0][1200/2402]   Time 0.092 (0.098)      Data 0.000 (0.001)      Loss 9.0300 (8.8087)    Prec@1 0.000 (0.000)    Prec@5 0.000 (0.020)
Epoch: [0][1300/2402]   Time 0.085 (0.097)      Data 0.000 (0.001)      Loss 9.0934 (8.8245)    Prec@1 0.000 (0.000)    Prec@5 0.000 (0.018)
Epoch: [0][1400/2402]   Time 0.090 (0.097)      Data 0.000 (0.001)      Loss 9.0064 (8.8394)    Prec@1 0.000 (0.000)    Prec@5 0.000 (0.017)
Epoch: [0][1500/2402]   Time 0.089 (0.096)      Data 0.000 (0.001)      Loss 9.0079 (8.8531)    Prec@1 0.000 (0.000)    Prec@5 0.000 (0.016)
Epoch: [0][1600/2402]   Time 0.089 (0.096)      Data 0.000 (0.000)      Loss 9.0939 (8.8666)    Prec@1 0.000 (0.000)    Prec@5 0.000 (0.015)
Epoch: [0][1700/2402]   Time 0.088 (0.095)      Data 0.000 (0.000)      Loss 9.2418 (8.8800)    Prec@1 0.000 (0.000)    Prec@5 0.000 (0.014)
Epoch: [0][1800/2402]   Time 0.108 (0.095)      Data 0.000 (0.000)      Loss 9.1652 (8.8923)    Prec@1 0.000 (0.000)    Prec@5 0.000 (0.013)
Epoch: [0][1900/2402]   Time 0.089 (0.095)      Data 0.000 (0.000)      Loss 9.1364 (8.9050)    Prec@1 0.000 (0.000)    Prec@5 0.000 (0.012)
Epoch: [0][2000/2402]   Time 0.091 (0.095)      Data 0.000 (0.000)      Loss 9.1717 (8.9169)    Prec@1 0.000 (0.000)    Prec@5 0.000 (0.012)
Epoch: [0][2100/2402]   Time 0.092 (0.095)      Data 0.000 (0.000)      Loss 9.0747 (8.9283)    Prec@1 0.000 (0.000)    Prec@5 0.000 (0.011)
Epoch: [0][2200/2402]   Time 0.093 (0.095)      Data 0.000 (0.000)      Loss 9.2118 (8.9398)    Prec@1 0.000 (0.000)    Prec@5 0.000 (0.011)
Epoch: [0][2300/2402]   Time 0.093 (0.094)      Data 0.000 (0.000)      Loss 9.2229 (8.9508)    Prec@1 0.000 (0.000)    Prec@5 0.000 (0.010)
Epoch: [0][2400/2402]   Time 0.095 (0.094)      Data 0.000 (0.000)      Loss 9.2780 (8.9618)    Prec@1 0.000 (0.000)    Prec@5 0.000 (0.010)
lr: 0.01
Epoch: [1][0/2402]      Time 0.493 (0.493)      Data 0.447 (0.447)      Loss 8.8280 (8.8280)    Prec@1 0.000 (0.000)    Prec@5 0.000 (0.000)
Epoch: [1][100/2402]    Time 0.092 (0.095)      Data 0.000 (0.005)      Loss nan (nan)  Prec@1 0.000 (0.000)    Prec@5 0.000 (2.498)
Epoch: [1][200/2402]    Time 0.092 (0.094)      Data 0.000 (0.002)      Loss nan (nan)  Prec@1 0.000 (0.000)    Prec@5 0.000 (1.255)
Epoch: [1][300/2402]    Time 0.094 (0.094)      Data 0.000 (0.002)      Loss nan (nan)  Prec@1 0.000 (0.000)    Prec@5 0.000 (0.838)

...

I tried a few times but always has this problem. Do you have any ideas to solve this?

Thanks!

pytorch version

Hi AlfredXiangWu, could you tell me which pytorch version of your work, I tested several versions but they all did't work.

What if image is RGB?

I want to know if I need to extract the face representation of RGB image, how to change the network structure for fitting this mission, and how to preprocess the training data and test data?

Why always use SGD with momentum instead of Adam？

Hi:
Could you please tell me that why do you always use SGD with momentum instead of Adam when training a net. Many thanks.

which MS Celeb 1M dataset to use

which MS-Celeb-1M dataset to use aligned / cropped or thumbnails? On which dataset have you uploded the cleanlist of MS-Celeb-1M?

Checkpoint Loading Problem on CPU

I am using the LightCNN for feature extraction on CPU.
Initially, I used,
model_path = 'LightCNN/LightCNN_9Layers_checkpoint.pth.tar'
model = LightCNN_9Layers(num_classes=79077)
#model.eval()
checkpoint = torch.load(model_path, map_location=lambda storage, loc: storage)
model.load_state_dict(checkpoint['state_dict'])

I got the error below
**RuntimeError: Error(s) in loading state_dict for network_9layers:
Missing key(s) in state_dict: "features.0.filter.bias", "features.0.filter.weight", ...

Unexpected key(s) in state_dict: "module.features.0.filter.weight", "module.features.0.filter.bias", ...**

So I changed the last part of the code to
#load the model
checkpoint = torch.load(model_path, map_location=lambda storage, loc: storage)
model.load_state_dict(checkpoint['state_dict'], strict=False)

After that everything works fine. Features I extracted have expected number of inputs. However, the result am getting when I do simple cosine similarity test on two identical images is far far below expectation. Hence I begin to wonder whether using load_state_dict in the manner I did loads weights in random manner.

Help will be appreciated. Thanks

What is the val_list of CASIA-WebFace when you pre-trained the LightCNN-29v2?

Hi, could you please provide the val_list of CASIA-WebFace when you pre-trained the LightCNN-29v2?

Thanks.

Pytorch code training problem

Hi!
We trying to train LightCNN-9 or LightCNN-29 using your code(Pytorch) and your default params.
But always the result was NaN(from first iteration). We tryed it on CASIA and Celeb datasets with same results. Do you use this code for training in your paper? Or maybe there is some tricks?

forward pass example

any forward pass example of trained model in python ?

Resume from checkpoint

Sairam.
Hi, I trained lightccnn_29_v2 on MSceleb DB for 14 epochs. lr reduced from 0.001 to 0.0004575 at step size of 10. Validation Accuracy improved from 86 to 95.95 and Avg loss reduced from 11 to 0.28 after 14 epochs.
Now, when I resume from saved model, it starts well as shown below:
Test set: Average loss: 0.28508767582333855, Accuracy: (95.95295308032168)
But loss is decreasing and Precision is also not improving as shown below:
Epoch: [14][0/38671] Loss 0.1796 (0.1796) Prec@1 94.531 (94.531) Prec@5 97.656 (97.656)
Epoch: [14][100/38671] Loss 0.2190 (0.2752) Prec@1 93.750 (93.379) Prec@5 99.219 (98.337)
Epoch: [14][200/38671] Loss 0.4000 (0.3149) Prec@1 89.062 (92.405) Prec@5 96.875 (98.084)
......
Epoch: [14][6100/38671] Loss 0.8118 (0.5597) Prec@1 85.938 (86.939) Prec@5 92.188 (95.961)
Epoch: [14][6200/38671] Loss 0.7565 (0.5611) Prec@1 80.469 (86.915) Prec@5 93.750 (95.945)
Epoch: [14][6300/38671] Loss 0.6364 (0.5625) Prec@1 84.375 (86.893) Prec@5 96.875 (95.927)

Did you face the above issue? Could you help me what could be issue?
Thanks.
Darshan,SSSIHL.

light CNN问题

MFM operation implement?

In your paper, you write mfm operation like this:

But in your code, you only implement the equation (1), why not the equation (4). Is that equation (4) is not good? Or some other reasons?
If you implement equation (4), could you upload the train and valuate results?
Thanks

In the phase of training/valing, Prec@1 and Prec@5

Validation images in batches.

Hi,

I am trying to send a batch of 64 images and trying to extract the features all at once from the model. However, after a certain number of images I get erroneous image features. Any idea why could that be?
I modified the extract_features.py file to do it.

pytorch精度远低于caffe？

您在这里的结果对照您的论文来看应该是最新版的在微软数据集训练的结果，从这里可以看出在LFW上pytorch的精度略低于caffe的精度。

我只使用casia webface作为训练集使用您的pytorch训练代码（lightcnn-9）最终在LFW 6000对的评测上仅能达到100% - EER 97.2 TPR@FAR=1% 94.333 BLUFR 结果为82.67 和 51.72。实验结果远低于您论文中的结果，但同样的训练数据在同样的网络下使用caffe可以复现接近您论文的结果。

请问您测试了使用pytorch框架仅用casia webface作为训练集的LFW评测结果吗？若有，可否将结果共享一份或者有什么建议可以提高pytorch的精度吗？

How can I set train.list and val.list for training?

Appreciate for your code.

I downloaded the MS-Celeb-1M clean list.
How do you set train_list and val_list?
I guess, is it alright?
I will be split MS-Celeb-1M clean list for tran_list and val_list manually

How do you deal with this situation?

In LFW evaluation of the 6,000 test images, how do you deal with images where faces cannot be detected by the dlib face detector?

Can you please provide the face alignment script to reproduce the evaluation results?

What is the typical training accuracy while training on MS-celeb-1m 70k?

Hi, I wonder what is your typical accuracy during your training process .For me, I got an accuracy of top1 at 95.5% , by the way ,the learning rate is 1*e-5. In your experiment, did you get a obviously higher accuracy on training set?In other words, should I keep on decreasing learning rate?
Thank you!

Can not extract the checkpoint file from the LightCNN_checkpoint.pth.tar file.

Hi,

It seems that I can not extract the checkpoint file from the LightCNN_checkpoint.pth.tar file. The LightCNN_checkpoint.pth.tar in the google drive is broken. Can you update a new one @AlfredXiangWu ?

When did loss begin to decrease?

We are experimentingwith mfm29 architecture.
The papers were implemented as a caffe, and the pytorch code was used as it was for your experiment.
In both experiments, loss is about 10.7 and does not decrease. I learned more than 60 epoch, but it is the same situation.

In your experiment, I wonder when loss starts to decrease.
And I want you to let me know how long it took until the training is completed.

thanks.

how to choose learning rate

Hi @AlfredXiangWu !
Bother you again Orz
I want to train the dataset 1M_Cele, can I just use the scale, step default as 0.457305, 5 respectively, and change the epoch to 250~~ I am afraid if it will be convergence ?
Or May I change the paras?

Thanks a lot!

Some questions.

Do not need image normalization?

I know that many face recognition frameworks use the normalization of the position of the face landmark. Does LightCNN require such a process?

In addition, what is the size of the image set? I would like to use an image set consisting of 700,000 images and 8,500 labels. Is it possible?

Sorry for the short English. Thanks for reading.

The loss value does not converge

Hi @AlfredXiangWu ,

Thank you for your great work. I have tried your code training on your MS-Celeb list from scratch. First of all, I used learning rate of 0.001 and I got Nan value after a few iteration . So, I tried to running with smaller learning rate of 0.0001 but the loss value did not decrease and fluctuated after 80 epoches.

Here is my configuration:

for name, value in model.named_parameters():
if 'bias' in name:
if 'fc2' in name:
params += [{'params':value, 'lr': 10 * args.lr, 'weight_decay': 0}]
else:
params += [{'params':value, 'lr': 2 * args.lr, 'weight_decay': 0}]
else:
if 'fc2' in name:
params += [{'params':value, 'lr': 10 * args.lr}]
else:
params += [{'params':value, 'lr': 1 * args.lr}]

Thanks,
Hai

What's the differences between /train/list.txt and /val/list.txt in content?

Hi, Algred:
Could you please show me what's the differences between `/train/list.txt` and `/val/list.txt` in content? Expecting your reply.

from

python train.py --root_path=/path/to/your/datasets/ 
		--train_list=/path/to/your/train/list.txt 
		--val_list=/path/to/your/val/list.txt 
		--save_path=/path/to/your/save/path/ 
		--model="LightCNN-9/LightCNN-29" --num_classes=n

Different performance under BLURF protocal

Hi, @AlfredXiangWu
I use features and matlab code you provided to test the ROC results and got exactly the same reults.
But when I test it under the lfw BLURF protocal, I got different results.
First, I found out that the lfw list file you provided is different from the official list provided by the BLURF, so I extract your features followed with the official list.
And then I test the VR and DIR by following the example matlab code provided by BLURF, got the following results:
VR@FAR=0.1%: 89.54%
DIR@FAR=1%: 57.03%
which is lower than yours.

May I ask what method you used to get the results?
In my way, I just load your features and delete the PCA training part of the "demo_pca.m" provided by BLURF.

无法解压模型文件，是否损坏？

您好，我在下载模型（LightCNN_29Layers_checkpoint.pth.tar）后，无法解压，请问是我的解压方式不对吗？我在Windows和linux下都试过。

on the training

Hi, Alfred,

regarding the training, is there a limitation on the number of class and number of image of each class if using your default setting? e.g., can use 100 class?

Data preprocessing

Hi
How can I align and crop the images? there is no part of code doing the image preprocessing

Thanks

Which training set did you use to train the released model of LightCNN-29?

Hi, @AlfredXiangWu !

I found that you had released the model of LightCNN-29 in Google Drive.
But i am wondering that which datasets did you use while training?
CASIA-WebFace, VGG-Face or MS-Celeb-1M??

Could you please give me more details?
Many thanks!!

Valentina

How to accelerate the training process？

Training a batch needs 1.5 minutes. I want to know if this is normal speed.

Model file issue?

I have download the LightCNN_9Layers_checkpoint.pth.tar and LightCNN_29Layers_checkpoint.pth.tar files ,but when I decompression these files , error occurs! can you sent me these files to me, my email [email protected],thank you very much!

About val dataset?

Hi!
When you train the model with MS-Celeb-1M, what is the validation dataset, how to generate the val dataset?

duplicate between CASIA-WebFace and MS-Celeb-1M

Hi,AlfredXiangWu.
In your latest training code,training num is 80013,and the last training code,training num is 79077,and casia is 10575,is CASIA-WebFace and MS-Celeb-1M has about 10000 duplicate id?

Is this normal?

lr: 0.0005
Epoch: [5][0/3907] Time 0.543 (0.543) Data 0.466 (0.466) Loss 6.8968 (6.8968) Prec@1 0.000 (0.000) Prec@5 1.562 (1.562)
Epoch: [5][100/3907] Time 0.182 (0.186) Data 0.000 (0.005) Loss 6.1796 (6.3176) Prec@1 0.000 (0.348) Prec@5 0.000 (2.088)
Epoch: [5][200/3907] Time 0.182 (0.185) Data 0.000 (0.003) Loss 5.9996 (6.1570) Prec@1 0.000 (0.253) Prec@5 0.000 (2.037)
Epoch: [5][300/3907] Time 0.186 (0.185) Data 0.000 (0.002) Loss 7.6430 (6.2538) Prec@1 0.000 (0.363) Prec@5 0.000 (2.370)
Epoch: [5][400/3907] Time 0.179 (0.184) Data 0.000 (0.001) Loss 5.9914 (6.2375) Prec@1 0.000 (0.273) Prec@5 0.000 (1.847)
Epoch: [5][500/3907] Time 0.184 (0.184) Data 0.000 (0.001) Loss 6.0924 (6.2430) Prec@1 0.000 (0.253) Prec@5 0.000 (1.695)
Epoch: [5][600/3907] Time 0.181 (0.184) Data 0.000 (0.001) Loss 5.2746 (6.2718) Prec@1 0.000 (0.220) Prec@5 0.000 (1.687)
Epoch: [5][700/3907] Time 0.183 (0.184) Data 0.000 (0.001) Loss 4.4259 (6.3011) Prec@1 0.000 (0.188) Prec@5 0.781 (1.493)
Epoch: [5][800/3907] Time 0.185 (0.184) Data 0.000 (0.001) Loss 6.1079 (6.3245) Prec@1 0.000 (0.165) Prec@5 0.000 (1.332)
Epoch: [5][900/3907] Time 0.188 (0.184) Data 0.000 (0.001) Loss 3.8808 (6.3316) Prec@1 0.000 (0.147) Prec@5 28.906 (1.234)
Epoch: [5][1000/3907] Time 0.186 (0.184) Data 0.000 (0.001) Loss 6.4368 (6.3144) Prec@1 0.000 (0.142) Prec@5 0.000 (1.218)
Epoch: [5][1100/3907] Time 0.182 (0.184) Data 0.000 (0.001) Loss 7.2492 (6.3498) Prec@1 0.000 (0.129) Prec@5 0.000 (1.113)
Epoch: [5][1200/3907] Time 0.183 (0.184) Data 0.000 (0.001) Loss 5.3767 (6.3607) Prec@1 0.000 (0.120) Prec@5 0.000 (1.043)
Epoch: [5][1300/3907] Time 0.182 (0.184) Data 0.000 (0.001) Loss 5.2591 (6.3527) Prec@1 0.000 (0.112) Prec@5 0.000 (0.984)
Epoch: [5][1400/3907] Time 0.182 (0.184) Data 0.000 (0.001) Loss 6.1982 (6.3612) Prec@1 0.000 (0.107) Prec@5 0.000 (0.960)
Epoch: [5][1500/3907] Time 0.183 (0.184) Data 0.000 (0.001) Loss 7.2118 (6.4134) Prec@1 0.000 (0.100) Prec@5 0.000 (0.908)

After five epoch, training data is shown above. I don't know if this is normal?

乱码问题

你好。我读入抽取的特征的.feat文件，读入出来是一团乱码

L�A��6�sYcAЋ���W�A�i�A�����Ni�9��A�wFA��;�P�

不知道是不是由于python2的编码问题造成的？

Can not extract the checkpoint file from the LightCNN_checkpoint.pth.tar file.

When I load LigthCNN_checkpoint.pth.tar by extract_feature.py ,I get the error such as 'unexpected key "module.features.0.filter.weight" in state_dict'. What can I do? @AlfredXiangWu

light 29V2 FC2 layer no bias

Hi, Xiang,

I noticed that you set the the bias of FC2 layer to 'False' (lightcnn 29 v2) . Is there any special reason for that?

Best regards,
Ming

Issue loading the model

Hi!

Could you please assist me with the following issue.

I try to load the LightCNN_9Layers checkpoint model for inference using the following code(following the code from extract_features.py)

from LightCNN import light_cnn
model = nn.DataParallel(light_cnn.LightCNN_9Layers(num_classes=79077)).cuda()
model.load_state_dict(torch.load('LightCNN_9Layers_checkpoint.pth.tar')['state_dict'])

I get the following error

	While copying the parameter named "module.features.0.filter.weight", whose dimensions in the model are torch.Size([96, 1, 3, 3]) and whose dimensions in the checkpoint are torch.Size([96, 1, 5, 5]).

What am I doing wrong?
My guess is that kernel size is wrong in the model definition. Could it be the case?

I use torch 0.4

Thanks in advance.

optimizer error

Thanks for your work. I'm new to pytorch and there's an error comes up when I run the train.py script:

Traceback (most recent call last):
File "train.py", line 275, in
main()
File "train.py", line 85, in main
weight_decay=args.weight_decay)
File "/usr/local/lib/python2.7/dist-packages/torch/optim/sgd.py", line 56, in init
super(SGD, self).init(params, defaults)
File "/usr/local/lib/python2.7/dist-packages/torch/optim/optimizer.py", line 61, in init
raise ValueError("can't optimize a non-leaf Variable")
ValueError: can't optimize a non-leaf Variable

It looks like the problem is related to the modified implementation of SGD. Would you mind give me a hint to deal with this, thanks~

THCudaCheck FAIL

Hi @AlfredXiangWu ,

my training datafile like this:
/home/jonanza/datasets/casia/CASIA__144/3599667/037.png 0
/home/jonanza/datasets/casia/CASIA__144/1466221/082.png 1
/home/jonanza/datasets/casia/CASIA__144/1466221/044.png 1
...

and training ...
it broke here:
Epoch: [0][3500/3568] Time 0.108 (0.112) Data 0.000 (0.000) Loss 8.8549 (9.2667) Prec@1 0.000 (0.000) Prec@5 0.000 (0.008)

/opt/conda/conda-bld/pytorch_1502006348621/work/torch/lib/THCUNN/ClassNLLCriterion.cu:57: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [15,0,0] Assertion t >= 0 && t < n_classes failed.
/opt/conda/conda-bld/pytorch_1502006348621/work/torch/lib/THCUNN/ClassNLLCriterion.cu:57: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [16,0,0] Assertion t >= 0 && t < n_classes failed.
/opt/conda/conda-bld/pytorch_1502006348621/work/torch/lib/THCUNN/ClassNLLCriterion.cu:57: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [17,0,0] Assertion t >= 0 && t < n_classes failed.
/opt/conda/conda-bld/pytorch_1502006348621/work/torch/lib/THCUNN/ClassNLLCriterion.cu:57: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [18,0,0] Assertion t >= 0 && t < n_classes failed.
/opt/conda/conda-bld/pytorch_1502006348621/work/torch/lib/THCUNN/ClassNLLCriterion.cu:57: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [19,0,0] Assertion t >= 0 && t < n_classes failed.
/opt/conda/conda-bld/pytorch_1502006348621/work/torch/lib/THCUNN/ClassNLLCriterion.cu:57: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [20,0,0] Assertion t >= 0 && t < n_classes failed.
/opt/conda/conda-bld/pytorch_1502006348621/work/torch/lib/THCUNN/ClassNLLCriterion.cu:57: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [21,0,0] Assertion t >= 0 && t < n_classes failed.
/opt/conda/conda-bld/pytorch_1502006348621/work/torch/lib/THCUNN/ClassNLLCriterion.cu:57: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [22,0,0] Assertion t >= 0 && t < n_classes failed.
/opt/conda/conda-bld/pytorch_1502006348621/work/torch/lib/THCUNN/ClassNLLCriterion.cu:57: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [23,0,0] Assertion t >= 0 && t < n_classes failed.
/opt/conda/conda-bld/pytorch_1502006348621/work/torch/lib/THCUNN/ClassNLLCriterion.cu:57: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [24,0,0] Assertion t >= 0 && t < n_classes failed.
/opt/conda/conda-bld/pytorch_1502006348621/work/torch/lib/THCUNN/ClassNLLCriterion.cu:57: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [25,0,0] Assertion t >= 0 && t < n_classes failed.
/opt/conda/conda-bld/pytorch_1502006348621/work/torch/lib/THCUNN/ClassNLLCriterion.cu:57: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [26,0,0] Assertion t >= 0 && t < n_classes failed.
/opt/conda/conda-bld/pytorch_1502006348621/work/torch/lib/THCUNN/ClassNLLCriterion.cu:57: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [27,0,0] Assertion t >= 0 && t < n_classes failed.
/opt/conda/conda-bld/pytorch_1502006348621/work/torch/lib/THCUNN/ClassNLLCriterion.cu:57: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [28,0,0] Assertion t >= 0 && t < n_classes failed.
/opt/conda/conda-bld/pytorch_1502006348621/work/torch/lib/THCUNN/ClassNLLCriterion.cu:57: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [29,0,0] Assertion t >= 0 && t < n_classes failed.
/opt/conda/conda-bld/pytorch_1502006348621/work/torch/lib/THCUNN/ClassNLLCriterion.cu:57: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [30,0,0] Assertion t >= 0 && t < n_classes failed.
/opt/conda/conda-bld/pytorch_1502006348621/work/torch/lib/THCUNN/ClassNLLCriterion.cu:57: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [31,0,0] Assertion t >= 0 && t < n_classes failed.
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1502006348621/work/torch/lib/THC/generic/THCStorage.c line=32 error=59 : device-side assert triggered
Traceback (most recent call last):
File "train.py", line 279, in
main()
File "train.py", line 136, in main
train(train_loader, model, criterion, optimizer, epoch)
File "train.py", line 175, in train
print(loss.data[0], input.size(0))
RuntimeError: cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1502006348621/work/torch/lib/THC/generic/THCStorage.c:32

I searched the problem, and they said it was beyond indexes, i have no idea about this Orz

lfw evaluation code

Thanks for your source code.

could you give me the information that you use lfw evaluation code?

restore with worse performance

I have a problem about restoring model. When I restore a light cnn model, the loss and accuracy of the first epoch is still good, which matches with the performance of such model. But then performance got much worse from the second epoch.

There has two ways to save and restore model from pytorch website. I check the code the save and restore model, which uses the first way to do this (torch.save(the_model.state_dict(), PATH). Do you think the problem come from this? Thank you!

Best regards,
Ming

Example of lfw training and validation list

Hi:

Could you please provide the list file for lfw training and validation. Many thanks.

pytorch model can't convert to caffe model

I use pytorch2caffe tool to convert model, but failed.

m = LightCNN_29Layers_v2(num_classes=80013)
checkpoint = torch.load('LightCNN_29Layers_V2_checkpoint.pth.tar')
m.load_state_dict(checkpoint['state_dict'])
m.eval()

input_var = Variable(torch.rand(1, 1, 128, 128))
#input = torch.zeros(1, 1, 128, 128)
#input_var = torch.autograd.Variable(input, volatile=True)
output_var = m(input_var)
plot_graph(output_var, os.path.join(os.getcwd(), 'lightcnn.png'))

pytorch2caffe(input_var, output_var,
os.path.join(os.getcwd(), 'lightcnn_v2.prototxt'),
os.path.join(os.getcwd(), 'lightcnn_v2.caffemodel'))

plot graph to png

plot_graph(output_var, os.path.join(os.getcwd(), 'lightcnn.png'))

The error is

plot_graph(output_var, os.path.join(os.getcwd(), 'lightcnn.png'))
File "~/my_test/pytorch2caffe/pytorch2caffe.py", line 379, in plot_graph
add_nodes(top_var.grad_fn)
AttributeError: 'tuple' object has no attribute 'grad_fn'

thanks for your attention :)

Validation Set

Where can I get or generate the validation list required for the training? I'm trying to train the network on MS-Celeb-1M.
Also, is the pretrained model trained on the Full Image Thumbnails, FaceCropped, or FaceAligned dataset of MS-Celeb-1M?

Can not extract tarred pretrained model

Hi,
I downloaded your pre-trained models from Google drive and ran the following command to extract the models. I am not able to extract the models and getting the errors as:

$ tar -xvf LightCNN_9Layers_checkpoint.pth.tar 
tar: This does not look like a tar archive
tar: Skipping to next header
tar: Exiting with failure status due to previous errors

Could you check this issue?

How do you acheive 67ms for lightcnn-9 on i7-4790?

How can I test lightcnn-9 on cpu model?