hi, I am a newbie about tensorflow and I have some confusion about training loss. at the beginning, the d_loss are always bigger than g_loss shown in follow, which is different with the result in the dcgan.torch. it is normal?
Epoch: [ 0] [ 0/3165] time: 2.6742, d_loss: 7.06172514, g_loss: 0.00106246
Epoch: [ 0] [ 1/3165] time: 4.7950, d_loss: 6.95885229, g_loss: 0.00125823
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:244] PoolAllocator: After 2487 get requests, put_count=2449 evicted_count=1000 eviction_rate=0.40833 and unsatisfied allocation rate=0.457579
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:256] Raising pool_size_limit_ from 100 to 110
Epoch: [ 0] [ 2/3165] time: 6.1750, d_loss: 7.25680256, g_loss: 0.00104984
Epoch: [ 0] [ 3/3165] time: 7.5547, d_loss: 6.89718437, g_loss: 0.00364288
Epoch: [ 0] [ 4/3165] time: 8.9374, d_loss: 5.45913506, g_loss: 0.02250301
Epoch: [ 0] [ 5/3165] time: 10.3308, d_loss: 7.75127983, g_loss: 0.00146924
Epoch: [ 0] [ 6/3165] time: 11.7197, d_loss: 4.75904989, g_loss: 0.04762752
Epoch: [ 0] [ 7/3165] time: 13.1084, d_loss: 5.15711403, g_loss: 0.03134135
Epoch: [ 0] [ 8/3165] time: 14.4916, d_loss: 5.35569286, g_loss: 0.04407354
Epoch: [ 0] [ 9/3165] time: 15.8697, d_loss: 4.73206615, g_loss: 0.05500766
Epoch: [ 0] [ 10/3165] time: 17.2533, d_loss: 3.20903492, g_loss: 0.34848747
Epoch: [ 0] [ 11/3165] time: 18.6306, d_loss: 8.54726505, g_loss: 0.00069389
Epoch: [ 0] [ 12/3165] time: 20.0077, d_loss: 1.97646499, g_loss: 2.43928814
Epoch: [ 0] [ 13/3165] time: 21.3865, d_loss: 8.04584980, g_loss: 0.00098451
Epoch: [ 0] [ 14/3165] time: 22.7670, d_loss: 1.93407261, g_loss: 2.32639980
Epoch: [ 0] [ 15/3165] time: 24.1502, d_loss: 8.04065609, g_loss: 0.00085019
Epoch: [ 0] [ 16/3165] time: 25.5354, d_loss: 2.01121569, g_loss: 4.33200264
Epoch: [ 0] [ 17/3165] time: 26.9249, d_loss: 5.53398705, g_loss: 0.01694447
Epoch: [ 0] [ 18/3165] time: 28.3111, d_loss: 1.31883585, g_loss: 5.00018692
Epoch: [ 0] [ 19/3165] time: 29.6976, d_loss: 5.65370369, g_loss: 0.01064641
In addition, only the d_loss is large than g_loss and keep stable, the training process can keep going. However, there will suddenly appear a Nan error during the training process.
Epoch: [ 0] [2344/3165] time: 3322.8894, d_loss: 1.39191413, g_loss: 0.75624585
Epoch: [ 0] [2345/3165] time: 3324.2772, d_loss: 1.60122275, g_loss: 0.36871552
Epoch: [ 0] [2346/3165] time: 3325.6905, d_loss: 1.57876384, g_loss: 0.70225191
Epoch: [ 0] [2347/3165] time: 3327.0963, d_loss: 1.39167571, g_loss: 0.59910935
Epoch: [ 0] [2348/3165] time: 3328.4929, d_loss: 1.43457556, g_loss: 0.60285681
Epoch: [ 0] [2349/3165] time: 3329.8979, d_loss: 1.47647548, g_loss: 0.66025651
Traceback (most recent call last):
File "main.py", line 59, in
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 30, in run
sys.exit(main(sys.argv))
File "main.py", line 43, in main
dcgan.train(FLAGS)
File "/data/project/DCGAN-tensorflow-master/model.py", line 204, in train
feed_dict={ self.z: batch_z })
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 340, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 564, in _run
feed_dict_string, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 637, in _do_run
target_list, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 659, in _do_call
e.code)
tensorflow.python.framework.errors.InvalidArgumentError: Nan in summary histogram for: HistogramSummary_2
[[Node: HistogramSummary_2 = HistogramSummary[T=DT_FLOAT, device="/job:localhost/replica:0/task:0/cpu:0"](HistogramSummary_2/tag, Sigmoid_1/126)]]
Caused by op u'HistogramSummary_2', defined at:
File "main.py", line 59, in
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 30, in run
sys.exit(main(sys.argv))
File "main.py", line 40, in main
dataset_name=FLAGS.dataset, is_crop=FLAGS.is_crop, checkpoint_dir=FLAGS.checkpoint_dir)
File "/data/project/DCGAN-tensorflow-master/model.py", line 69, in init
self.build_model()
File "/data/project/DCGAN-tensorflow-master/model.py", line 99, in build_model
self.d__sum = tf.histogram_summary("d", self.D)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/logging_ops.py", line 113, in histogram_summary
tag=tag, values=values, name=scope)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_logging_ops.py", line 55, in _histogram_summary
name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/op_def_library.py", line 655, in apply_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2154, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1154, in init
self._traceback = _extract_stack()
any advice?
Thanks and Best regards!