The error below is with Tensorflow 1.9 . I also tried with 1.8, and get a different error, which seems unrelated but which I also pasted below for posterity.
W0727 03:45:33.424566 140200254113536 __init__.py:44] file_cache is unavailable when using oauth2client >= 4.0.0 or google-auth
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/googleapiclient/discovery_cache/__init__.py", line 41, in autodetect
from . import file_cache
File "/usr/local/lib/python2.7/dist-packages/googleapiclient/discovery_cache/file_cache.py", line 41, in <module>
'file_cache is unavailable when using oauth2client >= 4.0.0 or google-auth')
ImportError: file_cache is unavailable when using oauth2client >= 4.0.0 or google-auth
W0727 03:45:33.622759 140200254113536 tf_logging.py:125] Estimator's model_fn (<function resnet_model_fn at 0x7f82cd0518c0>) includes params argument, but params are not passed to Estimator.
I0727 03:45:33.623796 140200254113536 tf_logging.py:115] Using config: {'_save_checkpoints_secs': None, '_session_config': allow_soft_placement: true
cluster_def {
job {
name: "worker"
tasks {
value: "10.240.1.2:8470"
}
}
}
, '_keep_checkpoint_max': 5, '_task_type': 'worker', '_train_distribute': None, '_is_chief': True, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f82cc71dbd0>, '_model_d\
ir': 'gs://basilica/tst', '_save_checkpoints_steps': 600, '_keep_checkpoint_every_n_hours': 10000, '_service': None, '_num_ps_replicas': 0, '_tpu_config': TPUConfig(iterations_per_loop=100, num_shards=8,\
computation_shape=None, per_host_input_for_training=3, tpu_job_name=None, initial_infeed_sleep_secs=None), '_tf_random_seed': None, '_save_summary_steps': 100, '_device_fn': None, '_cluster': <tensorflo\
w.contrib.cluster_resolver.python.training.tpu_cluster_resolver.TPUClusterResolver object at 0x7f82cd0536d0>, '_num_worker_replicas': 1, '_task_id': 0, '_log_step_count_steps': None, '_evaluation_master'\
: u'grpc://10.240.1.2:8470', '_global_id_in_cluster': 0, '_master': u'grpc://10.240.1.2:8470'}
I0727 03:45:33.624113 140200254113536 tf_logging.py:115] _TPUContext: eval_on_tpu True
I0727 03:45:33.624320 140200254113536 tf_logging.py:115] Precision: bfloat16
I0727 03:45:34.045315 140200254113536 tf_logging.py:115] Training for 112603 steps (90.00 epochs in total). Current step 0.
I0727 03:45:34.196645 140200254113536 tf_logging.py:115] Querying Tensorflow master (grpc://10.240.1.2:8470) for TPU system metadata.
2018-07-27 03:45:34.203501: W tensorflow/core/distributed_runtime/rpc/grpc_session.cc:349] GrpcSession::ListDevices will initialize the session with an empty graph and other defaults because the session \
has not yet been created.
I0727 03:45:34.290231 140200254113536 tf_logging.py:115] Found TPU system:
I0727 03:45:34.290575 140200254113536 tf_logging.py:115] *** Num TPU Cores: 8
I0727 03:45:34.290924 140200254113536 tf_logging.py:115] *** Num TPU Workers: 1
I0727 03:45:34.290985 140200254113536 tf_logging.py:115] *** Num TPU Cores Per Worker: 8
I0727 03:45:34.291043 140200254113536 tf_logging.py:115] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:CPU:0, CPU, -1)
I0727 03:45:34.291238 140200254113536 tf_logging.py:115] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 17179869184)
I0727 03:45:34.291296 140200254113536 tf_logging.py:115] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:XLA_GPU:0, XLA_GPU, 17179869184)
I0727 03:45:34.291357 140200254113536 tf_logging.py:115] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:0, TPU, 17179869184)
I0727 03:45:34.291414 140200254113536 tf_logging.py:115] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:1, TPU, 17179869184)
I0727 03:45:34.291465 140200254113536 tf_logging.py:115] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:2, TPU, 17179869184)
I0727 03:45:34.291536 140200254113536 tf_logging.py:115] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:3, TPU, 17179869184)
I0727 03:45:34.291625 140200254113536 tf_logging.py:115] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:4, TPU, 17179869184)
I0727 03:45:34.291692 140200254113536 tf_logging.py:115] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:5, TPU, 17179869184)
I0727 03:45:34.291764 140200254113536 tf_logging.py:115] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:6, TPU, 17179869184)
I0727 03:45:34.291825 140200254113536 tf_logging.py:115] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:7, TPU, 17179869184)
I0727 03:45:34.291878 140200254113536 tf_logging.py:115] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU_SYSTEM:0, TPU_SYSTEM, 17179869184)
I0727 03:45:34.306799 140200254113536 tf_logging.py:115] Calling model_fn.
I0727 03:45:45.303158 140200254113536 tf_logging.py:115] Create CheckpointSaverHook.
I0727 03:45:45.567073 140200254113536 tf_logging.py:115] Done calling model_fn.
I0727 03:45:48.642914 140200254113536 tf_logging.py:115] TPU job name worker
I0727 03:45:49.806272 140200254113536 tf_logging.py:115] Graph was finalized.
I0727 03:45:52.023736 140200254113536 tf_logging.py:115] Running local_init_op.
I0727 03:45:52.160645 140200254113536 tf_logging.py:115] Done running local_init_op.
I0727 03:45:59.174904 140200254113536 tf_logging.py:115] Saving checkpoints for 0 into gs://basilica/tst/model.ckpt.
I0727 03:46:06.400580 140200254113536 tf_logging.py:115] Installing graceful shutdown hook.
2018-07-27 03:46:06.401149: W tensorflow/core/distributed_runtime/rpc/grpc_session.cc:349] GrpcSession::ListDevices will initialize the session with an empty graph and other defaults because the session \
has not yet been created.
I0727 03:46:06.429027 140200254113536 tf_logging.py:115] Creating heartbeat manager for ['/job:tpu_worker/replica:0/task:0/device:CPU:0', '/job:tpu_worker/replica:0/task:0/device:XLA_CPU:0']
W0727 03:46:06.478003 140200254113536 tf_logging.py:120] Worker heartbeats not supported by all workers. No failure handling will be enabled.
I0727 03:46:06.478421 140200254113536 tf_logging.py:115] Init TPU system
I0727 03:46:12.805318 140199236466432 tf_logging.py:115] Starting infeed thread controller.
I0727 03:46:12.806582 140199227746048 tf_logging.py:115] Starting outfeed thread controller.
I0727 03:46:12.936597 140200254113536 tf_logging.py:115] Enqueue next (100) batch(es) of data to infeed.
I0727 03:46:12.937171 140200254113536 tf_logging.py:115] Dequeue next (100) batch(es) of data from outfeed.
W0727 03:51:35.339397 140199236466432 tf_logging.py:125]
Error occurred during infeed/outfeed. This may be due to a compile error in the main session. Waiting for a short time for the main session to come back.
End of sequence
[[Node: input_pipeline_task0/while/IteratorGetNext = IteratorGetNext[_class=["loc:@input_pipeline_task0/while/InfeedQueue/enqueue/0"], output_shapes=[[224,224,3,128], [128]], output_types=[DT_BF\
LOAT16, DT_INT32], _device="/job:worker/replica:0/task:0/device:CPU:0"](input_pipeline_task0/while/IteratorGetNext/Enter, ^input_pipeline_task0/while/Identity)]]
Caused by op u'input_pipeline_task0/while/IteratorGetNext', defined at:
File "resnet_main.py", line 506, in <module>
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "resnet_main.py", line 480, in main
input_fn=imagenet_train.input_fn, max_steps=next_checkpoint)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 366, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 1119, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 1132, in _train_model_default
features, labels, model_fn_lib.ModeKeys.TRAIN, self.config)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 1992, in _call_model_fn
features, labels, mode, config)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 1107, in _call_model_fn
model_fn_results = self._model_fn(features=features, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2212, in _model_fn
input_holders.generate_infeed_enqueue_ops_and_dequeue_fn())
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 1001, in generate_infeed_enqueue_ops_and_dequeue_fn
self._invoke_input_fn_and_record_structure())
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 1087, in _invoke_input_fn_and_record_structure
wrap_fn(device=host_device, op_fn=enqueue_ops_fn))
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2586, in _wrap_computation_in_while_loop
parallel_iterations=1)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 3209, in while_loop
result = loop_context.BuildLoop(cond, body, loop_vars, shape_invariants)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 2941, in BuildLoop
pred, body, original_loop_vars, loop_vars, shape_invariants)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 2878, in _BuildLoop
body_result = body(*packed_vars_for_body)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2575, in computation
with ops.control_dependencies(op_fn()):
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 794, in enqueue_ops_fn
features, labels = inputs.features_and_labels() # Calls get_next()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2766, in features_and_labels
return _Inputs._parse_inputs(self._iterator.get_next())
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/data/ops/iterator_ops.py", line 373, in get_next
name=name)), self._output_types,
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_dataset_ops.py", line 1745, in iterator_get_next
output_shapes=output_shapes, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 3414, in create_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1740, in __init__
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access
OutOfRangeError (see above for traceback): End of sequence
[[Node: input_pipeline_task0/while/IteratorGetNext = IteratorGetNext[_class=["loc:@input_pipeline_task0/while/InfeedQueue/enqueue/0"], output_shapes=[[224,224,3,128], [128]], output_types=[DT_BF\
LOAT16, DT_INT32], _device="/job:worker/replica:0/task:0/device:CPU:0"](input_pipeline_task0/while/IteratorGetNext/Enter, ^input_pipeline_task0/while/Identity)]]
E0727 03:51:40.460005 140199219025664 tf_logging.py:105] Feed error: Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 434, in _run_infeed
session.run(self._enqueue_ops)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 900, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1135, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1316, in _do_run
run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1335, in _do_call
raise type(e)(node_def, op, message)
OutOfRangeError: End of sequence
[[Node: input_pipeline_task0/while/IteratorGetNext = IteratorGetNext[_class=["loc:@input_pipeline_task0/while/InfeedQueue/enqueue/0"], output_shapes=[[224,224,3,128], [128]], output_types=[DT_BF\
LOAT16, DT_INT32], _device="/job:worker/replica:0/task:0/device:CPU:0"](input_pipeline_task0/while/IteratorGetNext/Enter, ^input_pipeline_task0/while/Identity)]]
Caused by op u'input_pipeline_task0/while/IteratorGetNext', defined at:
File "resnet_main.py", line 506, in <module>
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "resnet_main.py", line 480, in main
input_fn=imagenet_train.input_fn, max_steps=next_checkpoint)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 366, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 1119, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 1132, in _train_model_default
features, labels, model_fn_lib.ModeKeys.TRAIN, self.config)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 1992, in _call_model_fn
features, labels, mode, config)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 1107, in _call_model_fn
model_fn_results = self._model_fn(features=features, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2212, in _model_fn
input_holders.generate_infeed_enqueue_ops_and_dequeue_fn())
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 1001, in generate_infeed_enqueue_ops_and_dequeue_fn
self._invoke_input_fn_and_record_structure())
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 1087, in _invoke_input_fn_and_record_structure
wrap_fn(device=host_device, op_fn=enqueue_ops_fn))
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2586, in _wrap_computation_in_while_loop
parallel_iterations=1)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 3209, in while_loop
result = loop_context.BuildLoop(cond, body, loop_vars, shape_invariants)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 2941, in BuildLoop
pred, body, original_loop_vars, loop_vars, shape_invariants)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 2878, in _BuildLoop
body_result = body(*packed_vars_for_body)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2575, in computation
with ops.control_dependencies(op_fn()):
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 794, in enqueue_ops_fn
features, labels = inputs.features_and_labels() # Calls get_next()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2766, in features_and_labels
return _Inputs._parse_inputs(self._iterator.get_next())
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/data/ops/iterator_ops.py", line 373, in get_next
name=name)), self._output_types,
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_dataset_ops.py", line 1745, in iterator_get_next
output_shapes=output_shapes, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 3414, in create_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1740, in __init__
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access
OutOfRangeError (see above for traceback): End of sequence
[[Node: input_pipeline_task0/while/IteratorGetNext = IteratorGetNext[_class=["loc:@input_pipeline_task0/while/InfeedQueue/enqueue/0"], output_shapes=[[224,224,3,128], [128]], output_types=[DT_BF\
LOAT16, DT_INT32], _device="/job:worker/replica:0/task:0/device:CPU:0"](input_pipeline_task0/while/IteratorGetNext/Enter, ^input_pipeline_task0/while/Identity)]]
E0727 03:51:40.460639 140199219025664 tf_logging.py:105] Closing session. A RuntimeError should follow.
W0727 03:52:01.983748 140199227746048 tf_logging.py:125]
Error occurred during infeed/outfeed. This may be due to a compile error in the main session. Waiting for a short time for the main session to come back.
Step was cancelled by an explicit call to `Session::Close()`.
Traceback (most recent call last):
File "resnet_main.py", line 506, in <module>
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "resnet_main.py", line 480, in main
input_fn=imagenet_train.input_fn, max_steps=next_checkpoint)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 366, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 1119, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 1135, in _train_model_default
saving_listeners)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 1336, in _train_with_estimator_spec
_, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 577, in run
run_metadata=run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1053, in run
run_metadata=run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1144, in run
raise six.reraise(*original_exc_info)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1129, in run
return self._sess.run(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1201, in run
run_metadata=run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 981, in run
return self._sess.run(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 900, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1135, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1316, in _do_run
run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1335, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.CancelledError: Step was cancelled by an explicit call to `Session::Close()`.
W0727 04:13:10.758173 140606586783552 tf_logging.py:125] Estimator's model_fn (<function resnet_model_fn at 0x7fe174596050>) includes params argument, but params are not passed to Estimator.
I0727 04:13:10.759254 140606586783552 tf_logging.py:115] Using config: {'_save_checkpoints_secs': None, '_session_config': allow_soft_placement: true
cluster_def {
job {
name: "worker"
tasks {
value: "10.0.16.122:8470"
}
}
}
, '_keep_checkpoint_max': 5, '_task_type': 'worker', '_train_distribute': None, '_is_chief': True, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fe174597210>, '_model_di
r': 'gs://basilica/tst2', '_save_checkpoints_steps': 600, '_keep_checkpoint_every_n_hours': 10000, '_service': None, '_num_ps_replicas': 0, '_tpu_config': TPUConfig(iterations_per_loop=100, num_shards=8,
computation_shape=None, per_host_input_for_training=3, tpu_job_name=None, initial_infeed_sleep_secs=None), '_tf_random_seed': None, '_save_summary_steps': 100, '_device_fn': None, '_cluster': <tensorflow.
contrib.cluster_resolver.python.training.tpu_cluster_resolver.TPUClusterResolver object at 0x7fe174597090>, '_num_worker_replicas': 1, '_task_id': 0, '_log_step_count_steps': None, '_evaluation_master': '
grpc://10.0.16.122:8470', '_global_id_in_cluster': 0, '_master': 'grpc://10.0.16.122:8470'}
I0727 04:13:10.759718 140606586783552 tf_logging.py:115] _TPUContext: eval_on_tpu True
I0727 04:13:10.759957 140606586783552 tf_logging.py:115] Precision: bfloat16
I0727 04:13:10.900378 140606586783552 tf_logging.py:115] Training for 112603 steps (90.00 epochs in total). Current step 0.
I0727 04:13:11.008833 140606586783552 tf_logging.py:115] Querying Tensorflow master (grpc://10.0.16.122:8470) for TPU system metadata.
2018-07-27 04:13:11.010382: W tensorflow/core/distributed_runtime/rpc/grpc_session.cc:349] GrpcSession::ListDevices will initialize the session with an empty graph and other defaults because the session h
as not yet been created.
I0727 04:13:11.118653 140606586783552 tf_logging.py:115] Found TPU system:
I0727 04:13:11.119147 140606586783552 tf_logging.py:115] *** Num TPU Cores: 8
I0727 04:13:11.119513 140606586783552 tf_logging.py:115] *** Num TPU Workers: 1
I0727 04:13:11.119638 140606586783552 tf_logging.py:115] *** Num TPU Cores Per Worker: 8
I0727 04:13:11.119782 140606586783552 tf_logging.py:115] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:CPU:0, CPU, -1)
I0727 04:13:11.120002 140606586783552 tf_logging.py:115] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 17179869184)
I0727 04:13:11.120161 140606586783552 tf_logging.py:115] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:XLA_GPU:0, XLA_GPU, 17179869184)
I0727 04:13:11.120342 140606586783552 tf_logging.py:115] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:0, TPU, 17179869184)
I0727 04:13:11.120484 140606586783552 tf_logging.py:115] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:1, TPU, 17179869184)
I0727 04:13:11.120628 140606586783552 tf_logging.py:115] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:2, TPU, 17179869184)
I0727 04:13:11.120737 140606586783552 tf_logging.py:115] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:3, TPU, 17179869184)
I0727 04:13:11.120873 140606586783552 tf_logging.py:115] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:4, TPU, 17179869184)
I0727 04:13:11.120979 140606586783552 tf_logging.py:115] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:5, TPU, 17179869184)
I0727 04:13:11.121112 140606586783552 tf_logging.py:115] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:6, TPU, 17179869184)
I0727 04:13:11.121259 140606586783552 tf_logging.py:115] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:7, TPU, 17179869184)
I0727 04:13:11.121380 140606586783552 tf_logging.py:115] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU_SYSTEM:0, TPU_SYSTEM, 17179869184)
I0727 04:13:11.136253 140606586783552 tf_logging.py:115] Calling model_fn.
I0727 04:13:20.833169 140606586783552 tf_logging.py:115] Create CheckpointSaverHook.
I0727 04:13:21.049246 140606586783552 tf_logging.py:115] Done calling model_fn.
I0727 04:13:24.118320 140606586783552 tf_logging.py:115] TPU job name worker
I0727 04:13:25.411591 140606586783552 tf_logging.py:115] Graph was finalized.
Traceback (most recent call last):
File "/tensorflow_tpu_models/models/official/resnet/resnet_main.py", line 506, in <module>
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "/tensorflow_tpu_models/models/official/resnet/resnet_main.py", line 480, in main
input_fn=imagenet_train.input_fn, max_steps=next_checkpoint)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 366, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 1119, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 1135, in _train_model_default
saving_listeners)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 1333, in _train_with_estimator_spec
log_step_count_steps=self._config.log_step_count_steps) as mon_sess:
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 415, in MonitoredTrainingSession
stop_grace_period_secs=stop_grace_period_secs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 826, in __init__
stop_grace_period_secs=stop_grace_period_secs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 549, in __init__
self._sess = _RecoverableSession(self._coordinated_creator)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1012, in __init__
_WrappedSession.__init__(self, self._create_session())
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1017, in _create_session
return self._sess_creator.create_session()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 706, in create_session
self.tf_sess = self._session_creator.create_session()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 477, in create_session
init_fn=self._scaffold.init_fn)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/session_manager.py", line 287, in prepare_session
sess.run(init_op, feed_dict=init_feed_dict)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 900, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1135, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1316, in _do_run
run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1335, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.NotFoundError: Op type not registered 'MapAndBatchDatasetV2' in binary running on n-a432ad87-w-0. Make sure the Op and Kernel are registered in the binary running in this process.