Giter VIP home page Giter VIP logo

transfer_learning_tutorial's Introduction

Transfer Learning Tutorial

A guide to train the inception-resnet-v2 model in TensorFlow. Visit here for more information.

FAQ:

Q: Why does my evaluation code give such a poor performance although my training seem to be fine?

A: This could be due to an issue of how batch_norm is updated during training in the newer versions of TF, although I've not have the chance to investigate this properly. However, some users have mentioned that by setting is_training=True back in the eval code, the model works exactly as expected. You should try this method and see if it works for you.

For more information, please see this thread: #11

Q: How do I only choose to fine-tune certain layers instead of all the layers?

A: By default, if you did not specify an argument for variables_to_train in the function create_train_op (as seen in the train_flowers.py file), this argument is set to None and will train all the layers instead. If you want to fine-tune only certain layers, you have to pass a list of variable names to the variables_to_train argument. But you may ask, "how do I know the variable names of the model?" One simple way is to simply run this code within the graph context:

with tf.Graph().as_default() as graph:
    .... #after you have constructed the model in the graph etc..
    for i in tf.trainable_variables():
        print i

You will see the exact variable names that you can choose to fine-tune.

For more information, you should visit the documentation.


Q: Why is my code trying to restore variables like InceptionResnetV2/Repeat_1/block17_20/Conv2d_1x1/weights/Adam_1 when they are not found in the .ckpt file?

A: The code is no longer trying to restore variables from the .ckpt file, but rather, from the log directory where the checkpoint models of your previous training are stored. This error happens when you have changed the code but did not remove the previous log directory, and so the Supervisor will attempt to restore a checkpoint from your previous training, which will result in a mismatch of variables.

Solution: Simply remove your previous log directory and run the code again. This applies to both your training file and your evaluation file. See this issue for more information.


Q: Why is my loss performing so poorly after I updated the loss function from slim.losses.softmax_cross_entropy to tf.losses.softmax_cross_entropy?

A: The position of the arguments for the one-hot-labels and the predictions have changed, resulting in the wrong loss computed. This happens if you're using an older version of the repo, but I have since updated the losses to tf.losses and accounted for the change in argument positions.

Solution: git pull the master branch of the repository to get the updates.


Q: Why does the evaluation code fails to restore the checkpoint variables I had trained and saved? My training works correctly but the evaluation code crashes.

A: There was an error in the code that mistakenly allows the saver used to restore the variables to save the model variables after the training is completed. Because we made this saver exclude some variables to be restored earlier on, these excluded variables will not be saved by this saver if we use it to save all the variables when the training to be completed. Instead, the code should have used the Supervisor's saver that exists internally to save the model variables in the end, since all trained variables will then be saved.

Usually, this does not occur if you have trained your model for more than 10 minutes, since the Supervisor's saver will save the variables every 10 minutes. However, if you end your training before 10 minutes, the wrong saver would have saved only some trained variables, and not all trained variables (which is what we want).

Solution: git pull the master branch of the repository to get the updates. I have changed the training code to make the supervisor save the variables at the end of the training instead.

transfer_learning_tutorial's People

Contributors

kwotsin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

transfer_learning_tutorial's Issues

Number of Steps per Epoch not int

Hi,
using your train_flower.py script with my Dataset, an error occurred, that num_steps_per_epoch could not be converted from float to int.

I therefore explicitly casted num_batches_per_epoch into int, so that line 186 looked like this:
num_batches_per_epoch = int(dataset.num_samples / batch_size)

tf.train.import_meta_graph

hello, I'm trying to use the same tutorial on my own dataset , by using the MobileNet network , but the name of files is mobilenet_v1_0.75_128.ckpt.data-00000-of-00001 mobilenet_v1_0.75_128.ckpt.index mobilenet_v1_0.75_128.ckpt.meta and when I fit one of these files in the function saver.restore, I got an error , how can I include the instruction of tf.train.import_meta_graph in this code to load the meta graph ? Thanks.

Not found: Key InceptionResnetV2/Repeat_1/block17_20/Conv2d_1x1/weights/Adam_1

Hey, thanks for your nice work.

Your code is very clear, and I think there is no problem to run it.
However, when restore weights from checkpoint file, I encountered:

2017-05-12 21:15:45.857973: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key InceptionResnetV2/Repeat_1/block17_20/Conv2d_1x1/weights/Adam_1 not found in checkpoint
         [[Node: save_1/RestoreV2_1287 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/cpu:0"](_recv_save_1/Const_0, save_1/RestoreV2_1287/tensor_names, save_1/RestoreV2_1287/shape_and_slices)]]
2017-05-12 21:15:45.858128: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key InceptionResnetV2/Repeat/block35_10/Branch_2/Conv2d_0a_1x1/weights/Adam_1 not found in checkpoint
2017-05-12 21:15:45.858508: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key InceptionResnetV2/Repeat_1/block17_20/Conv2d_1x1/weights/Adam_1 not found in checkpoint
         [[Node: save_1/RestoreV2_1287 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/cpu:0"](_recv_save_1/Const_0, save_1/RestoreV2_1287/tensor_names, save_1/RestoreV2_1287/shape_and_slices)]]
2017-05-12 21:15:45.858674: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key InceptionResnetV2/Repeat_1/block17_18/Branch_1/Conv2d_0a_1x1/BatchNorm/beta/Adam_1 not found in checkpoint
2017-05-12 21:15:45.858837: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key InceptionResnetV2/Repeat/block35_10/Branch_2/Conv2d_0b_3x3/BatchNorm/beta/Adam not found in checkpoint
2017-05-12 21:15:45.858861: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key InceptionResnetV2/Repeat/block35_10/Branch_2/Conv2d_0b_3x3/BatchNorm/beta/Adam_1 not found in checkpoint
2017-05-12 21:15:45.861648: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key InceptionResnetV2/Repeat_1/block17_20/Conv2d_1x1/weights/Adam_1 not found in checkpoint
         [[Node: save_1/RestoreV2_1287 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/cpu:0"](_recv_save_1/Const_0, save_1/RestoreV2_1287/tensor_names, save_1/RestoreV2_1287/shape_and_slices)]]

I also use the tools to display all weights in checkpoint file and do not find layers such as XXX/weights/Adam_1, neither find it in network definition file: inception_resnet_v2.py. So, I guess
the weights XXX/weights/Adam_1 maybe caused by learning algorithm in train_flowers.py:

#Now we can define the optimizer that takes on the learning rate
optimizer = tf.train.AdamOptimizer(learning_rate = lr)

Thanks for help!

Steven

Inconstant Performance with Adam

Dear GIT owner, thanks for the excellent tutorial.

  1. My issue is I got relatively low performance 23.2% on the validation set exactly with your code. However, if I switch from Adam to GradientDescent optimiser this problem vanishs. Do you have any clue why this is happening?
  2. If I change "is_training==True" during testing in: logits, end_points = inception.inception_resnet_v2(
    images,
    num_classes = dataset.num_classes,
    is_training = True)
    Then this problem vanishes. But I don't think appropriate.
  3. My system is Ubuntu 14.04 with TF 1.1.0

After train my new add layer, how to fine turn with InceptionResnetV2 together next??

I have added some layer after InceptionResnetV2, and add them to variables_to_train. Then, only train the new layer, but the weight in InceptionResnetV2 did not change. Everything is fine by now.
However, I want to do next, turn down learning rate and then train the layer in InceptionResnetV2 and my new add layer together. At this time, I got the NOT FOUND ERR like "InceptionResnetV2/Repeat_1/block17_20/Conv2d_1x1/weights/Adam_1 when they are not found in the .ckpt file"
I have read the Q&A, but I have to do the fine turn like what I said, what should I do next??

UnknownError: Input/Output Error

Hi! I was following your guide on transfer learning since I wanted to implement it on my own dataset.
When I ran it, after some steps, I am persistently getting this error..
Kindly let me know how to resolve it.

/usr/local/lib/python3.6/dist-packages/h5py/init.py:36: FutureWarning: Conversion of the second argument of issubdtype from float to np.floating is deprecated. In future, it will be treated as np.float64 == np.dtype(float).type.
from ._conv import register_converters as _register_converters
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/losses/losses_impl.py:731: softmax_cross_entropy_with_logits (from tensorflow.python.ops.nn_ops) is deprecated and will be removed in a future version.
Instructions for updating:

Future major versions of TensorFlow will allow gradients to flow
into the labels input on backprop by default.

See tf.nn.softmax_cross_entropy_with_logits_v2.

WARNING:tensorflow:From train_alz.py:207: get_or_create_global_step (from tensorflow.contrib.framework.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.get_or_create_global_step
WARNING:tensorflow:From train_alz.py:226: streaming_accuracy (from tensorflow.contrib.metrics.python.ops.metric_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.metrics.accuracy. Note that the order of the labels and predictions arguments has been switched.
WARNING:tensorflow:From train_alz.py:257: Supervisor.init (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.MonitoredTrainingSession
2018-03-13 12:58:03.471697: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:898] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-03-13 12:58:03.472282: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1212] Found device 0 with properties:
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:00:04.0
totalMemory: 11.17GiB freeMemory: 11.11GiB
2018-03-13 12:58:03.472315: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1312] Adding visible gpu devices: 0
2018-03-13 12:58:03.749993: I tensorflow/core/common_runtime/gpu/gpu_device.cc:993] Creating TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10774 MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: 0000:00:04.0, compute capability: 3.7)
INFO:tensorflow:Restoring parameters from /content/drive/app/log/model.ckpt-202
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Starting standard services.
INFO:tensorflow:Saving checkpoint to path /content/drive/app/log/model.ckpt
INFO:tensorflow:Starting queue runners.
INFO:tensorflow:global_step/sec: 0
INFO:tensorflow:Epoch 1.0/1
INFO:tensorflow:Current Learning Rate: 0.0002
INFO:tensorflow:Current Streaming Accuracy: 0.0
logits:
[[ 0.390966 -0.45528865]
[ 0.19611153 -0.2771586 ]
[ 0.3723789 -0.26325732]
[ 0.25104412 -0.26455274]
[ 0.18112303 -0.13101362]
[ 0.29722977 -0.26156056]
[ 0.36001077 -0.2920575 ]
[ 0.19284382 -0.29659665]
[ 0.3931404 -0.3488116 ]
[ 0.2608421 -0.16697197]
[ 0.09929308 -0.10392085]
[ 0.34856573 -0.14461401]
[ 0.43742967 -0.34236282]
[ 0.24543124 -0.3380828 ]
[ 0.30201262 -0.35735092]
[ 0.452847 -0.3696723 ]
[ 0.26588187 -0.32089466]
[ 0.2634747 -0.24695566]
[ 0.3682 -0.3491458 ]
[ 0.32025513 -0.31040746]
[ 0.2997362 -0.2824477 ]
[ 0.13588724 -0.07790426]
[ 0.15063035 -0.09512474]
[ 0.24009264 -0.2981124 ]
[ 0.13652855 -0.20672432]
[ 0.3162898 -0.32216516]
[ 1.388223 -1.8755943 ]
[ 0.2017616 -0.23009525]
[ 0.16159095 -0.3134655 ]
[ 0.33364242 -0.274227 ]
[ 0.3692241 -0.3140025 ]
[ 0.37961924 -0.29509106]]
Probabilities:
[[0.6997809 0.30021912]
[0.6161575 0.38384253]
[0.65376633 0.34623367]
[0.6261176 0.37388238]
[0.5774067 0.4225933 ]
[0.63617265 0.3638274 ]
[0.65747637 0.3425236 ]
[0.6199746 0.3800254 ]
[0.6774225 0.32257745]
[0.60535157 0.39464843]
[0.5506294 0.44937062]
[0.6208552 0.37914476]
[0.6856354 0.31436458]
[0.6418756 0.3581244 ]
[0.6591174 0.34088257]
[0.6947709 0.30522916]
[0.64262515 0.35737482]
[0.6249073 0.37509266]
[0.6720223 0.3279777 ]
[0.6526397 0.34736034]
[0.64156973 0.35843024]
[0.55324525 0.44675475]
[0.5611314 0.43886855]
[0.63139474 0.36860523]
[0.5849804 0.4150195 ]
[0.6544041 0.3455959 ]
[0.9631665 0.03683354]
[0.606317 0.39368302]
[0.6165799 0.38342014]
[0.6474546 0.35254538]
[0.66445845 0.33554155]
[0.662557 0.33744293]]
predictions:
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
Labels:
: [0 0 0 1 0 0 1 0 1 0 1 0 1 0 1 1 0 0 0 0 0 1 0 0 1 1 0 1 1 1 0 0]
INFO:tensorflow:global step 212: loss: 1.1660 (10.36 sec/step)
INFO:tensorflow:global step 213: loss: 1.0437 (36.21 sec/step)
INFO:tensorflow:global step 214: loss: 1.0660 (2.66 sec/step)
INFO:tensorflow:global step 215: loss: 1.1729 (2.72 sec/step)
INFO:tensorflow:global step 216: loss: 1.0920 (2.65 sec/step)
INFO:tensorflow:global step 217: loss: 1.0246 (2.77 sec/step)
INFO:tensorflow:global step 218: loss: 1.1273 (3.22 sec/step)
INFO:tensorflow:global step 219: loss: 1.0768 (2.96 sec/step)
INFO:tensorflow:global step 220: loss: 1.1072 (2.61 sec/step)
INFO:tensorflow:global step 221: loss: 1.0079 (2.62 sec/step)
INFO:tensorflow:global step 222: loss: 1.0546 (2.61 sec/step)
INFO:tensorflow:global step 223: loss: 1.1526 (2.63 sec/step)
INFO:tensorflow:global step 224: loss: 1.1774 (2.61 sec/step)
INFO:tensorflow:global step 225: loss: 1.0615 (2.63 sec/step)
INFO:tensorflow:global step 226: loss: 1.2225 (2.62 sec/step)
INFO:tensorflow:global step 227: loss: 1.1922 (2.63 sec/step)
INFO:tensorflow:global step 228: loss: 0.9374 (2.71 sec/step)
INFO:tensorflow:global step 229: loss: 1.1487 (2.63 sec/step)
INFO:tensorflow:global step 230: loss: 1.1996 (2.62 sec/step)
INFO:tensorflow:global step 231: loss: 1.1277 (2.67 sec/step)
INFO:tensorflow:global step 232: loss: 1.0695 (2.62 sec/step)
INFO:tensorflow:global step 233: loss: 1.0415 (2.63 sec/step)
INFO:tensorflow:global_step/sec: 0.183334
INFO:tensorflow:global step 234: loss: 1.0975 (2.64 sec/step)
INFO:tensorflow:global step 235: loss: 1.0874 (2.63 sec/step)
INFO:tensorflow:global step 236: loss: 1.0492 (2.63 sec/step)
INFO:tensorflow:global step 237: loss: 1.0601 (2.62 sec/step)
INFO:tensorflow:global step 238: loss: 0.9987 (2.62 sec/step)
INFO:tensorflow:global step 239: loss: 1.0551 (2.63 sec/step)
INFO:tensorflow:global step 240: loss: 1.0787 (2.63 sec/step)
INFO:tensorflow:global step 241: loss: 1.1399 (2.63 sec/step)
INFO:tensorflow:global step 242: loss: 1.1331 (2.63 sec/step)
INFO:tensorflow:global step 243: loss: 1.1698 (2.65 sec/step)
INFO:tensorflow:global step 244: loss: 1.2579 (2.64 sec/step)
INFO:tensorflow:global step 245: loss: 0.9157 (2.64 sec/step)
INFO:tensorflow:global step 246: loss: 1.0891 (2.62 sec/step)
INFO:tensorflow:global step 247: loss: 1.0873 (2.64 sec/step)
INFO:tensorflow:global step 248: loss: 1.2327 (2.63 sec/step)
INFO:tensorflow:global step 249: loss: 1.0241 (2.63 sec/step)
INFO:tensorflow:global step 250: loss: 0.9981 (2.63 sec/step)
INFO:tensorflow:global step 251: loss: 1.1361 (2.62 sec/step)
INFO:tensorflow:global step 252: loss: 1.1365 (2.62 sec/step)
INFO:tensorflow:global step 253: loss: 1.0260 (3.33 sec/step)
INFO:tensorflow:global step 254: loss: 1.0760 (2.63 sec/step)
INFO:tensorflow:global step 255: loss: 1.0620 (2.64 sec/step)
INFO:tensorflow:global step 256: loss: 0.9445 (2.62 sec/step)
INFO:tensorflow:global step 257: loss: 1.0008 (2.62 sec/step)
INFO:tensorflow:global step 258: loss: 1.0649 (2.65 sec/step)
INFO:tensorflow:global step 259: loss: 1.0324 (2.64 sec/step)
INFO:tensorflow:global step 260: loss: 0.9293 (2.62 sec/step)
INFO:tensorflow:global step 261: loss: 1.0554 (2.65 sec/step)
INFO:tensorflow:global step 262: loss: 1.0520 (2.61 sec/step)
INFO:tensorflow:global step 263: loss: 0.9560 (2.63 sec/step)
INFO:tensorflow:global step 264: loss: 0.9489 (2.64 sec/step)
INFO:tensorflow:global step 265: loss: 1.2756 (2.63 sec/step)
INFO:tensorflow:global step 266: loss: 1.1692 (2.62 sec/step)
INFO:tensorflow:global step 267: loss: 0.9642 (2.62 sec/step)
INFO:tensorflow:global step 268: loss: 1.1805 (2.62 sec/step)
INFO:tensorflow:global step 269: loss: 0.9002 (2.62 sec/step)
INFO:tensorflow:global step 270: loss: 1.0323 (2.62 sec/step)
INFO:tensorflow:global step 271: loss: 1.1099 (2.64 sec/step)
INFO:tensorflow:global step 272: loss: 1.0631 (2.63 sec/step)
INFO:tensorflow:global step 273: loss: 1.1938 (2.64 sec/step)
INFO:tensorflow:global step 274: loss: 1.0271 (2.62 sec/step)
INFO:tensorflow:global step 275: loss: 1.0757 (2.63 sec/step)
INFO:tensorflow:global step 276: loss: 1.0682 (2.64 sec/step)
INFO:tensorflow:global step 277: loss: 1.1326 (2.64 sec/step)
INFO:tensorflow:global_step/sec: 0.366475
INFO:tensorflow:global step 278: loss: 1.1278 (2.64 sec/step)
INFO:tensorflow:global step 279: loss: 1.1442 (2.63 sec/step)
INFO:tensorflow:global step 280: loss: 1.0570 (2.63 sec/step)
INFO:tensorflow:global step 281: loss: 1.0622 (2.65 sec/step)
INFO:tensorflow:global step 282: loss: 1.0390 (2.63 sec/step)
INFO:tensorflow:global step 283: loss: 1.0679 (2.63 sec/step)
INFO:tensorflow:global step 284: loss: 1.0516 (2.63 sec/step)
INFO:tensorflow:global step 285: loss: 1.0004 (2.62 sec/step)
INFO:tensorflow:global step 286: loss: 1.1611 (2.61 sec/step)
INFO:tensorflow:global step 287: loss: 1.0007 (2.62 sec/step)
INFO:tensorflow:global step 288: loss: 1.0110 (2.61 sec/step)
INFO:tensorflow:global step 289: loss: 0.9180 (2.62 sec/step)
INFO:tensorflow:global step 290: loss: 0.9802 (2.62 sec/step)
INFO:tensorflow:global step 291: loss: 0.9764 (2.63 sec/step)
INFO:tensorflow:global step 292: loss: 1.1125 (2.61 sec/step)
INFO:tensorflow:global step 293: loss: 1.1945 (2.63 sec/step)
INFO:tensorflow:global step 294: loss: 1.0703 (2.61 sec/step)
INFO:tensorflow:global step 295: loss: 0.9467 (2.64 sec/step)
INFO:tensorflow:global step 296: loss: 1.0956 (2.62 sec/step)
INFO:tensorflow:global step 297: loss: 1.0210 (2.62 sec/step)
INFO:tensorflow:global step 298: loss: 1.1059 (2.61 sec/step)
INFO:tensorflow:global step 299: loss: 1.0867 (2.63 sec/step)
INFO:tensorflow:global step 300: loss: 0.9882 (2.62 sec/step)
INFO:tensorflow:global step 301: loss: 1.0315 (2.62 sec/step)
INFO:tensorflow:global step 302: loss: 1.1471 (2.63 sec/step)
INFO:tensorflow:global step 303: loss: 0.8937 (2.65 sec/step)
INFO:tensorflow:global step 304: loss: 1.1258 (2.65 sec/step)
INFO:tensorflow:global step 305: loss: 1.1221 (2.63 sec/step)
INFO:tensorflow:global step 306: loss: 0.9856 (2.62 sec/step)
INFO:tensorflow:global step 307: loss: 1.1103 (2.62 sec/step)
INFO:tensorflow:global step 308: loss: 0.9343 (2.63 sec/step)
INFO:tensorflow:global step 309: loss: 1.0796 (2.63 sec/step)
INFO:tensorflow:global step 310: loss: 0.9963 (2.63 sec/step)
INFO:tensorflow:global step 311: loss: 1.0599 (2.63 sec/step)
INFO:tensorflow:global step 312: loss: 0.9646 (2.63 sec/step)
INFO:tensorflow:global step 313: loss: 1.1737 (2.64 sec/step)
INFO:tensorflow:global step 314: loss: 1.0207 (2.63 sec/step)
INFO:tensorflow:global step 315: loss: 1.0296 (2.62 sec/step)
INFO:tensorflow:global step 316: loss: 1.0845 (2.64 sec/step)
INFO:tensorflow:global step 317: loss: 0.9909 (2.62 sec/step)
INFO:tensorflow:global step 318: loss: 1.1818 (2.63 sec/step)
INFO:tensorflow:global step 319: loss: 1.1320 (2.62 sec/step)
INFO:tensorflow:global step 320: loss: 1.0076 (2.63 sec/step)
INFO:tensorflow:global step 321: loss: 1.1933 (2.63 sec/step)
INFO:tensorflow:global_step/sec: 0.366859
INFO:tensorflow:global step 322: loss: 1.0940 (2.64 sec/step)
INFO:tensorflow:global step 323: loss: 0.9568 (2.65 sec/step)
INFO:tensorflow:global step 324: loss: 1.0478 (2.64 sec/step)
INFO:tensorflow:global step 325: loss: 1.0858 (2.62 sec/step)
INFO:tensorflow:global step 326: loss: 1.1083 (2.63 sec/step)
INFO:tensorflow:global step 327: loss: 1.1087 (2.64 sec/step)
INFO:tensorflow:global step 328: loss: 1.1253 (2.62 sec/step)
INFO:tensorflow:global step 329: loss: 1.0707 (2.64 sec/step)
INFO:tensorflow:global step 330: loss: 1.0384 (2.62 sec/step)
INFO:tensorflow:global step 331: loss: 1.1274 (2.63 sec/step)
INFO:tensorflow:global step 332: loss: 1.0456 (2.62 sec/step)
INFO:tensorflow:global step 333: loss: 1.0076 (2.64 sec/step)
INFO:tensorflow:global step 334: loss: 0.9234 (2.63 sec/step)
INFO:tensorflow:global step 335: loss: 1.0679 (2.64 sec/step)
INFO:tensorflow:global step 336: loss: 1.0232 (2.63 sec/step)
INFO:tensorflow:global step 337: loss: 1.0858 (2.61 sec/step)
INFO:tensorflow:global step 338: loss: 1.2137 (2.63 sec/step)
INFO:tensorflow:global step 339: loss: 1.0514 (2.62 sec/step)
INFO:tensorflow:global step 340: loss: 0.9872 (2.63 sec/step)
INFO:tensorflow:global step 341: loss: 1.2556 (2.63 sec/step)
INFO:tensorflow:global step 342: loss: 1.0966 (2.63 sec/step)
INFO:tensorflow:global step 343: loss: 1.1035 (3.21 sec/step)
INFO:tensorflow:global step 344: loss: 0.9896 (2.64 sec/step)
INFO:tensorflow:global step 345: loss: 0.9457 (2.63 sec/step)
INFO:tensorflow:global step 346: loss: 0.9988 (2.63 sec/step)
INFO:tensorflow:global step 347: loss: 1.0911 (2.63 sec/step)
INFO:tensorflow:global step 348: loss: 1.0068 (2.63 sec/step)
INFO:tensorflow:global step 349: loss: 1.1755 (2.63 sec/step)
INFO:tensorflow:global step 350: loss: 1.0673 (2.63 sec/step)
INFO:tensorflow:global step 351: loss: 1.0352 (2.64 sec/step)
INFO:tensorflow:global step 352: loss: 1.0703 (2.64 sec/step)
INFO:tensorflow:global step 353: loss: 1.0696 (2.64 sec/step)
INFO:tensorflow:global step 354: loss: 1.0939 (2.62 sec/step)
INFO:tensorflow:global step 355: loss: 0.9891 (2.62 sec/step)
INFO:tensorflow:global step 356: loss: 0.9528 (2.63 sec/step)
INFO:tensorflow:global step 357: loss: 0.9560 (2.62 sec/step)
INFO:tensorflow:global step 358: loss: 0.9822 (2.63 sec/step)
INFO:tensorflow:global step 359: loss: 0.9218 (2.62 sec/step)
INFO:tensorflow:global step 360: loss: 1.0637 (2.61 sec/step)
INFO:tensorflow:global step 361: loss: 1.0187 (2.63 sec/step)
INFO:tensorflow:global step 362: loss: 1.0093 (2.63 sec/step)
INFO:tensorflow:global step 363: loss: 1.1393 (2.65 sec/step)
INFO:tensorflow:global step 364: loss: 0.9312 (2.63 sec/step)
INFO:tensorflow:global step 365: loss: 0.9443 (2.66 sec/step)
INFO:tensorflow:global_step/sec: 0.366662
INFO:tensorflow:global step 366: loss: 0.9431 (2.62 sec/step)
INFO:tensorflow:global step 367: loss: 1.0542 (2.62 sec/step)
INFO:tensorflow:global step 368: loss: 0.8600 (2.62 sec/step)
INFO:tensorflow:global step 369: loss: 1.0323 (2.63 sec/step)
INFO:tensorflow:global step 370: loss: 1.0610 (2.62 sec/step)
INFO:tensorflow:global step 371: loss: 1.0132 (2.63 sec/step)
INFO:tensorflow:global step 372: loss: 1.0827 (2.62 sec/step)
INFO:tensorflow:global step 373: loss: 1.3068 (2.65 sec/step)
INFO:tensorflow:global step 374: loss: 1.0503 (2.61 sec/step)
INFO:tensorflow:global step 375: loss: 0.8931 (2.62 sec/step)
INFO:tensorflow:global step 376: loss: 0.8584 (2.62 sec/step)
INFO:tensorflow:global step 377: loss: 1.0598 (2.62 sec/step)
INFO:tensorflow:global step 378: loss: 0.9794 (2.62 sec/step)
INFO:tensorflow:global step 379: loss: 1.1428 (2.63 sec/step)
INFO:tensorflow:global step 380: loss: 1.1398 (2.62 sec/step)
INFO:tensorflow:global step 381: loss: 1.0700 (2.61 sec/step)
INFO:tensorflow:global step 382: loss: 0.9613 (2.63 sec/step)
INFO:tensorflow:global step 383: loss: 1.1044 (2.65 sec/step)
INFO:tensorflow:global step 384: loss: 1.0618 (2.64 sec/step)
INFO:tensorflow:global step 385: loss: 0.9081 (2.62 sec/step)
INFO:tensorflow:global step 386: loss: 0.9855 (2.63 sec/step)
INFO:tensorflow:global step 387: loss: 1.0260 (2.64 sec/step)
INFO:tensorflow:global step 388: loss: 0.9812 (2.67 sec/step)
INFO:tensorflow:global step 389: loss: 0.8943 (2.62 sec/step)
INFO:tensorflow:global step 390: loss: 1.0939 (2.64 sec/step)
INFO:tensorflow:global step 391: loss: 1.0230 (2.63 sec/step)
INFO:tensorflow:global step 392: loss: 0.9868 (2.62 sec/step)
INFO:tensorflow:global step 393: loss: 1.0887 (2.63 sec/step)
INFO:tensorflow:global step 394: loss: 1.0332 (2.63 sec/step)
INFO:tensorflow:global step 395: loss: 0.9921 (2.63 sec/step)
INFO:tensorflow:global step 396: loss: 0.9881 (2.63 sec/step)
INFO:tensorflow:global step 397: loss: 1.1466 (2.62 sec/step)
INFO:tensorflow:global step 398: loss: 1.0280 (2.63 sec/step)
INFO:tensorflow:global step 399: loss: 1.1928 (2.62 sec/step)
INFO:tensorflow:global step 400: loss: 1.2401 (2.63 sec/step)
INFO:tensorflow:global step 401: loss: 1.0159 (2.63 sec/step)
INFO:tensorflow:global step 402: loss: 1.1129 (2.62 sec/step)
INFO:tensorflow:global step 403: loss: 1.0318 (2.63 sec/step)
INFO:tensorflow:global step 404: loss: 1.0792 (2.63 sec/step)
INFO:tensorflow:global step 405: loss: 0.9984 (2.62 sec/step)
INFO:tensorflow:global step 406: loss: 1.1533 (2.63 sec/step)
INFO:tensorflow:global step 407: loss: 0.9931 (2.62 sec/step)
INFO:tensorflow:global step 408: loss: 1.1291 (2.63 sec/step)
INFO:tensorflow:global step 409: loss: 1.0902 (2.62 sec/step)
INFO:tensorflow:Saving checkpoint to path /content/drive/app/log/model.ckpt
INFO:tensorflow:global_step/sec: 0.36667
INFO:tensorflow:global step 410: loss: 1.0421 (2.69 sec/step)
INFO:tensorflow:global step 411: loss: 1.0558 (2.64 sec/step)
INFO:tensorflow:global step 412: loss: 1.1632 (2.63 sec/step)
INFO:tensorflow:global step 413: loss: 1.0283 (3.06 sec/step)
INFO:tensorflow:global step 414: loss: 1.0561 (2.69 sec/step)
INFO:tensorflow:global step 415: loss: 0.9693 (2.69 sec/step)
INFO:tensorflow:global step 416: loss: 1.0304 (2.68 sec/step)
INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.UnknownError'>, /content/drive/app/images/train/fmri_train_00001-of-00001.tfrecord; Input/output error
[[Node: parallel_read/ReaderReadV2 = ReaderReadV2[_device="/job:localhost/replica:0/task:0/device:CPU:0"](parallel_read/TFRecordReaderV2, parallel_read/filenames)]]
INFO:tensorflow:global step 417: loss: 0.9838 (2.76 sec/step)
INFO:tensorflow:global step 418: loss: 0.7267 (1.44 sec/step)
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1361, in _do_call
return fn(*args)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1340, in _run_fn
target_list, status, run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/errors_impl.py", line 516, in exit
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.OutOfRangeError: FIFOQueue '_3_batch/fifo_queue' is closed and has insufficient elements (requested 32, current size 0)
[[Node: batch = QueueDequeueUpToV2[component_types=[DT_FLOAT, DT_UINT8, DT_INT64], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/device:CPU:0"](batch/fifo_queue, batch/n)]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/supervisor.py", line 990, in managed_session
yield sess
File "train_alz.py", line 285, in run
loss, _ = train_step(sess, train_op, sv.global_step)
File "train_alz.py", line 243, in train_step
total_loss, global_step_count, _ = sess.run([train_op, global_step, metrics_op])
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 905, in run
run_metadata_ptr)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1137, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1355, in _do_run
options, run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1374, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.OutOfRangeError: FIFOQueue '_3_batch/fifo_queue' is closed and has insufficient elements (requested 32, current size 0)
[[Node: batch = QueueDequeueUpToV2[component_types=[DT_FLOAT, DT_UINT8, DT_INT64], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/device:CPU:0"](batch/fifo_queue, batch/n)]]

Caused by op 'batch', defined at:
File "train_alz.py", line 298, in
run()
File "train_alz.py", line 184, in run
images, _, labels = load_batch(dataset, batch_size=batch_size)
File "train_alz.py", line 168, in load_batch
allow_smaller_final_batch = True)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/input.py", line 989, in batch
name=name)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/input.py", line 761, in _batch
dequeued = queue.dequeue_up_to(batch_size, name=name)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/data_flow_ops.py", line 527, in dequeue_up_to
self._queue_ref, n=n, component_types=self._dtypes, name=name)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/gen_data_flow_ops.py", line 2557, in _queue_dequeue_up_to_v2
component_types=component_types, timeout_ms=timeout_ms, name=name)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 3271, in create_op
op_def=op_def)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 1650, in init
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access

OutOfRangeError (see above for traceback): FIFOQueue '_3_batch/fifo_queue' is closed and has insufficient elements (requested 32, current size 0)
[[Node: batch = QueueDequeueUpToV2[component_types=[DT_FLOAT, DT_UINT8, DT_INT64], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/device:CPU:0"](batch/fifo_queue, batch/n)]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "train_alz.py", line 298, in
run()
File "train_alz.py", line 294, in run
sv.saver.save(sess, sv.save_path, global_step = sv.global_step)
File "/usr/lib/python3.6/contextlib.py", line 99, in exit
self.gen.throw(type, value, traceback)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/supervisor.py", line 1000, in managed_session
self.stop(close_summary_writer=close_summary_writer)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/supervisor.py", line 828, in stop
ignore_live_threads=ignore_live_threads)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/coordinator.py", line 389, in join
six.reraise(*self._exc_info_to_raise)
File "/usr/local/lib/python3.6/dist-packages/six.py", line 693, in reraise
raise value
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/queue_runner_impl.py", line 252, in _run
enqueue_callable()
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1259, in _single_operation_run
None)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/errors_impl.py", line 516, in exit
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.UnknownError: /content/drive/app/images/train/fmri_train_00001-of-00001.tfrecord; Input/output error
[[Node: parallel_read/ReaderReadV2 = ReaderReadV2[_device="/job:localhost/replica:0/task:0/device:CPU:0"](parallel_read/TFRecordReaderV2, parallel_read/filenames)]]

I followed your guide on transfer learning and hence wanted to know, why this error is occuring. Your help is appreciated!

loss

hi,
are you using loss, it isn't used after declaration ?
loss = tf.losses.softmax_cross_entropy(onehot_labels = one_hot_labels, logits = logits)

Do we need to normalize the image range to [0,1]?

Thanks for your great project. It is very useful for me. I would like to ask you two questions about the project:

  1. Do we need to normalize the input image to [0,1] (or zero mean, unit variance) as in the pre-processing step? I checked your project and it did not do it

  2. In the evaluation phase, why we need to run so much epoch. I think we just take the last checkpoint and feed all images in the validation set, then compute the accuracy.

  3. Could you provide a simple inference code that takes an image as input and show the prediction like you shows in the evaluation code, but it just takes image path as input?

Great job

Use my own data to finetune a model to predict?

Hello,I'm a beginner of deep learning,I have fintued inceptionv3 on my own data,now I'd like to use the finetuned model to predict,what should I do? It means that the input is .jpg image,not the .tfrecord files,and still with no labels attach to it(compared with .tfrecord files) ,and output is the prediction of the class that the image belongs to .Could you give me some suggestions?

Where is fetching batch data operation from queue in training and evaluation code?

Hey, your code help me a lot, and thanks again.

I can understand load_batch function in train_flowers.py will create a queue to produce batch data for training.

Howerver, in the loop:

for step in xrange(num_steps_per_epoch * num_epochs):

I can not find which operation do the work of fetching another batch explicitly. I guess it occurs in train_op automatically. Is it right?

For eval_flowers.py, I have the same question.

Can you give any explanation?

Thanks!

Steven

Training Error

Hello man,
I tried to train flower dataset based on your code but I have this problem
`phong@Storm:~/TransferLearning/transfer_learning_tutorial-master$ python train_flowers.py
2017-06-16 13:25:07.406129: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-06-16 13:25:07.406999: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-06-16 13:25:07.407158: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-06-16 13:25:07.587690: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:901] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2017-06-16 13:25:07.588069: I tensorflow/core/common_runtime/gpu/gpu_device.cc:887] Found device 0 with properties:
name: GeForce GTX 1070
major: 6 minor: 1 memoryClockRate (GHz) 1.7085
pciBusID 0000:02:00.0
Total memory: 7.92GiB
Free memory: 7.36GiB
2017-06-16 13:25:07.588087: I tensorflow/core/common_runtime/gpu/gpu_device.cc:908] DMA: 0
2017-06-16 13:25:07.588094: I tensorflow/core/common_runtime/gpu/gpu_device.cc:918] 0: Y
2017-06-16 13:25:07.588103: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1070, pci bus id: 0000:02:00.0)
INFO:tensorflow:Restoring parameters from /home/phong/inception_resnet_v2_2016_08_30.ckpt
2017-06-16 13:25:13.491820: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for /home/phong/inception_resnet_v2_2016_08_30.ckpt

In the end, the error is :
NotFoundError (see above for traceback): Unsuccessful TensorSliceReader constructor: Failed to find any matching files for ./inception_resnet_v2_2016_08_30.ckpt

Would you mind helping me to solve this ?

`

Input and output of the trained model

Hello guys, I followed your code to train my model and this is the result.
screenshot from 2018-02-06 18-41-22
I have one question though, what is the input and output node of the trained model ? I need those two information to convert it to another format in order to execute that trained model on mobile device.
Can anyone help me ?

Negative Dimension Size

Hey so I successfully created the tfrecords dataset and my dataset consists of 80 x 80 x 3 size images and 2 classes. I also edited the train_flowers.py to reflect the aforementioned changes i.e. change the variable image_size = 80 and num_classes = 2.

When I run the code, I get the following error:

I completed dataset information.
Successfully split the dataset
Successfully loaded the batchset
Traceback (most recent call last):
  File "train_flowers.py", line 300, in <module>
    run()
  File "train_flowers.py", line 195, in run
    logits, end_points = inception_resnet_v2(images, num_classes = dataset.num_classes, is_training = True)
  File "/path/to/inception_resnet_v2.py", line 197, in inception_resnet_v2
    scope='Conv2d_1a_3x3')
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/framework/python/ops/arg_scope.py", line 181, in func_with_args
    return func(*args, **current_args)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/layers/python/layers/layers.py", line 131, in avg_pool2d
    outputs = layer.apply(inputs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/layers/base.py", line 492, in apply
    return self.__call__(inputs, *args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/layers/base.py", line 441, in __call__
    outputs = self.call(inputs, *args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/layers/pooling.py", line 276, in call
    data_format=utils.convert_data_format(self.data_format, 4))
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/nn_ops.py", line 1741, in avg_pool
    name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_nn_ops.py", line 48, in _avg_pool
    data_format=data_format, name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2508, in create_op
    set_shapes_for_outputs(ret)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1873, in set_shapes_for_outputs
    shapes = shape_func(op)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1823, in call_with_requiring
    return call_cpp_shape_fn(op, require_shape_fn=True)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/common_shapes.py", line 610, in call_cpp_shape_fn
    debug_python_shape_fn, require_shape_fn)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/common_shapes.py", line 676, in _call_cpp_shape_fn_impl
    raise ValueError(err.message)
ValueError: Negative dimension size caused by subtracting 5 from 3 for 'InceptionResnetV2/AuxLogits/Conv2d_1a_3x3/AvgPool' (op: 'AvgPool') with input shapes: [?,3,3,1088].

It looks like a dimension issue, but can't figure out where its creeping in from. Have you experienced this before?

Karan

Try to do Validation while Training

Hello,
It will be beneficial to evaluate the model with validation set while training so that we can get a better ideas whether the model got overfit. I modified the code as below. I reused all the variables while only changing the input to be validation images. I then did session.run(accuracy_train,accuracy_validation) in the training step function. However, in the output the validation accuracy is always 0 while training accuracy is alright:
INFO: tensorflow:global step 18: loss: 1.1491 Train accuracy :0.5481 Validation accuracy :0.0000(24.39 sec/step)

I might have chosen a stupid way, is there any simpler or better way to do it? I was thinking about using validation monitor but there is no good example how to integrate it into this code. Any help is appreciated and the following is how I modified the code.

predictions = tf.argmax(end_points['Predictions'], 1)
probabilities = end_points['Predictions']
accuracy, accuracy_update = tf.contrib.metrics.streaming_accuracy(predictions, labels)
metrics_op = tf.group(accuracy_update, probabilities)
tf.get_variable_scope().reuse_variables()
with slim.arg_scope(inception_arg_scope()):
logits_validation, end_points_validation = inception_v3(raw_images_validation, num_classes =
dataset_validation.num_classes, reuse = True, is_training = False)
predictions_validation = tf.argmax(end_points_validation['Predictions'], 1)
probabilities_validation = end_points_validation['Predictions']
accuracy_validation, accuracy_update_validation =
tf.contrib.metrics.streaming_accuracy(predictions_validation, labels_validation)

In training step function :
total_loss, global_step_count, accuracy_value_train,accuracy_value_validation,_ = sess.run([train_op, global_step, accuracy, accuracy_validation, metrics_op])

Different learning rate of different layers

Question 1: In the fine tuning process, it seems that the excluded layers or new adding customized layers should have faster learning rates than the layers with restored parameters. Thus, how can we choose different learning rate for different layers?
Question 2: I am trying modify this tutorial to use inception model. However, inception model downloaded from google research blog does not have similar function as "inception_resnet_v2_arg_scope". It seems it is for normalizing and regularizing, but there is no such part in inception. Thus, the following code need to be changed but I am not sure how.
with slim.arg_scope(inception_resnet_v2_arg_scope()):
logits, end_points = inception_resnet_v2(images, num_classes = dataset.num_classes, is_training = True)

Thanks a lot!

ckpt file

IT is generating only .meta .data n .index
ho wcan i get just one .ckpt file ?

NotFoundError (see above for traceback): Key BatchNormalization/moving_mean not found in checkpoint

root@kali:~/Downloads/Chat-Bot-Emotion-Recognition-History-Recollection-master# python chatbot.py
/usr/local/lib/python2.7/dist-packages/h5py/init.py:36: FutureWarning: Conversion of the second argument of issubdtype from float to np.floating is deprecated. In future, it will be treated as np.float64 == np.dtype(float).type.
from ._conv import register_converters as _register_converters
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/datasets/base.py:198: retry (from tensorflow.contrib.learn.python.learn.datasets.base) is deprecated and will be removed in a future version.
Instructions for updating:
Use the retry module or similar alternatives.
ALSA lib pcm.c:2565:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.rear
ALSA lib pcm.c:2565:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.center_lfe
ALSA lib pcm.c:2565:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.side
ALSA lib pcm_route.c:869:(find_matching_chmap) Found no matching channel map
ALSA lib pcm_route.c:869:(find_matching_chmap) Found no matching channel map
ALSA lib pcm_route.c:869:(find_matching_chmap) Found no matching channel map
ALSA lib pcm_route.c:869:(find_matching_chmap) Found no matching channel map
/usr/local/lib/python2.7/dist-packages/sklearn/base.py:251: UserWarning: Trying to unpickle estimator TfidfTransformer from version 0.18.1 when using version 0.20.2. This might lead to breaking code or invalid results. Use at your own risk.
UserWarning)
/usr/local/lib/python2.7/dist-packages/sklearn/base.py:251: UserWarning: Trying to unpickle estimator TfidfVectorizer from version 0.18.1 when using version 0.20.2. This might lead to breaking code or invalid results. Use at your own risk.
UserWarning)
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tflearn/initializations.py:119: init (from tensorflow.python.ops.init_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.initializers.variance_scaling instead with distribution=uniform to get equivalent behavior.
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tflearn/objectives.py:66: calling reduce_sum (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version.
Instructions for updating:
keep_dims is deprecated, use keepdims instead
2019-01-02 00:37:07.265117: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-01-02 00:37:12.451769: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at save_restore_v2_ops.cc:184 : Not found: Key BatchNormalization/moving_mean not found in checkpoint
Traceback (most recent call last):
File "chatbot.py", line 77, in
emotion.start()
File "/root/Downloads/Chat-Bot-Emotion-Recognition-History-Recollection-master/emotion_recognition.py", line 84, in start
self.model.load('current_models/model_resnet_emotion-42000')
File "/usr/local/lib/python2.7/dist-packages/tflearn/models/dnn.py", line 308, in load
self.trainer.restore(model_file, weights_only, **optargs)
File "/usr/local/lib/python2.7/dist-packages/tflearn/helpers/trainer.py", line 490, in restore
self.restorer.restore(self.session, model_file)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1775, in restore
{self.saver_def.filename_tensor_name: save_path})
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 905, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1140, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1321, in _do_run
run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1340, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.NotFoundError: Key BatchNormalization/moving_mean not found in checkpoint
[[Node: save_1/RestoreV2 = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_BOOL, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save_1/Const_0_0, save_1/RestoreV2/tensor_names, save_1/RestoreV2/shape_and_slices)]]

Caused by op u'save_1/RestoreV2', defined at:
File "chatbot.py", line 77, in
emotion.start()
File "/root/Downloads/Chat-Bot-Emotion-Recognition-History-Recollection-master/emotion_recognition.py", line 82, in start
clip_gradients=0.)
File "/usr/local/lib/python2.7/dist-packages/tflearn/models/dnn.py", line 65, in init
best_val_accuracy=best_val_accuracy)
File "/usr/local/lib/python2.7/dist-packages/tflearn/helpers/trainer.py", line 147, in init
allow_empty=True)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1311, in init
self.build()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1320, in build
self._build(self._filename, build_save=True, build_restore=True)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1357, in _build
build_save=build_save, build_restore=build_restore)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 809, in _build_internal
restore_sequentially, reshape)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 448, in _AddRestoreOps
restore_sequentially)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 860, in bulk_restore
return io_ops.restore_v2(filename_tensor, names, slices, dtypes)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_io_ops.py", line 1458, in restore_v2
shape_and_slices=shape_and_slices, dtypes=dtypes, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 3290, in create_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1654, in init
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access

NotFoundError (see above for traceback): Key BatchNormalization/moving_mean not found in checkpoint
[[Node: save_1/RestoreV2 = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_BOOL, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save_1/Const_0_0, save_1/RestoreV2/tensor_names, save_1/RestoreV2/shape_and_slices)]]

root@kali:~/Downloads/Chat-Bot-Emotion-Recognition-History-Recollection-master#

why other dataset 's result is zero? Final Accuracy: 0.0

I transform the datasets (flowers) to tfrecords as your github shows, and the trainning performs correct.
However I change the datasets to another (17flowers), the structure as following:
flowers
flower_photos
0
....jpg
....jpg
....jpg
1
....jpg
2
....jpg
3
....jpg
.
.
.

    16\
        ....jpg

and the tfrecord is generated correctly.
Then I modify the relevant directory to adapt to my code, and also change ' num_classes = 17'.
However, the result as follows:

/usr/bin/python2.7 /home/cr/PycharmProjects/transferLearning/train_flowers.py
2017-07-19 22:25:03.216673: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-07-19 22:25:03.216740: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-07-19 22:25:03.216760: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-07-19 22:25:03.581012: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 0 with properties:
name: TITAN X (Pascal)
major: 6 minor: 1 memoryClockRate (GHz) 1.531
pciBusID 0000:03:00.0
Total memory: 11.90GiB
Free memory: 11.41GiB
2017-07-19 22:25:03.581046: I tensorflow/core/common_runtime/gpu/gpu_device.cc:961] DMA: 0
2017-07-19 22:25:03.581054: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0: Y
2017-07-19 22:25:03.581067: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:0) -> (device: 0, name: TITAN X (Pascal), pci bus id: 0000:03:00.0)
INFO:tensorflow:Restoring parameters from ./preTrainModels/inception_resnet_v2_2016_08_30.ckpt
2017-07-19 22:25:27.233384: I tensorflow/core/common_runtime/gpu/pool_allocator.cc:247] PoolAllocator: After 1901 get requests, put_count=1100 evicted_count=1000 eviction_rate=0.909091 and unsatisfied allocation rate=1
2017-07-19 22:25:27.233436: I tensorflow/core/common_runtime/gpu/pool_allocator.cc:259] Raising pool_size_limit_ from 100 to 110
INFO:tensorflow:Starting standard services.
INFO:tensorflow:Saving checkpoint to path ./log/model.ckpt
INFO:tensorflow:Starting queue runners.
INFO:tensorflow:Final Loss: Tensor("softmax_cross_entropy_loss/value:0", shape=(), dtype=float32)
INFO:tensorflow:global_step/sec: 0
INFO:tensorflow:Final Accuracy: 0.0
INFO:tensorflow:Finished training! Saving model to disk now.
INFO:tensorflow:global_step/sec: 0
Process finished with exit code 0

How to resolve this problem.thank you very much!

How to write the labels.txt?

when I compile train_flowers.py
there was an error:
Traceback (most recent call last):
File "train_flowers.py", line 33, in
label, string_name = line.split(':')
ValueError: not enough values to unpack (expected 2, got 1)

I set the labels.txt like
1 : Yellowcroaker
2 : notOurFishCropper
3 : pomfret-cropped
4 : thunnus
what is wrong with it?

labels of tfrecords parsed not correctly every now and then

Hi,
I made up the tf-records, created the new checkpoint and evaluated it through the code for evaluation. In the code for evaluation I put some additional lines for saving labels, predictions, probabilities and so on to be able to construct confusion matrices and other stuff. By doing that, I noticed that the labels (so the ground truth) are sometimes misread (sometimes 3 sometimes 4 or 5 over 176).

I have used a customed dataset and in the set for evaluation I have 176 images (44 for each class since I have 4 classes). I know that tf-records are built correctly since I have printed labels from the tf-records in this way

for example in tf.python_io.tf_record_iterator("path to .tfrecord"):
    result = tf.train.Example.FromString(example)
    a=result.features.feature['image/class/label'].int64_list.value
    print(a) 

and there are 44 labels for each class.

I found that quite similar issue https://github.com/tensorflow/tensorflow/issues/11363 but I didn't manage to understand what to change in the code.

Notice that in the code for evaluation I use batch_size=176 so that 1 step corresponds to 1 epoch and that I put "is_training=True" as others said in a past issue in order to get a reasonable accuracy (anyway I controlled labels also putting "is_training=False" just to see if it was a problem of batch normalization but the issue still occurs).

Thanks in advance to anyone who will help me.

eval by use is_trainning=True

Thank you for your FAQ about this!!! it confused me several days. By the way ,do you know why set is_trainning=True while eval?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.