tensorflow / tpu Goto Github PK

Reference models and tools for Cloud TPUs.

Home Page: https://cloud.google.com/tpu/

License: Apache License 2.0

Python 5.77% Jupyter Notebook 92.48% Shell 0.05% Go 0.37% Dockerfile 0.01% Starlark 0.01% Makefile 0.01% C 1.24% HCL 0.08%

tpu's Introduction

Cloud TPUs

This repository is a collection of reference models and tools used with Cloud TPUs.

The fastest way to get started training a model on a Cloud TPU is by following the tutorial. Click the button below to launch the tutorial using Google Cloud Shell.

Note: This repository is a public mirror, pull requests will not be accepted. Please file an issue if you have a feature or bug request.

Running Models

To run models in the models subdirectory, you may need to add the top-level /models folder to the Python path with the command:

export PYTHONPATH="$PYTHONPATH:/path/to/models"

tpu's People

Contributors

Stargazers

Watchers

Forkers

ycllz hereismari minhaotang cclauss hephaex play3577 algoskynet gfee-lyft rowhit davidvela nezdolik jreuben11 nikolayvoronchikhin hamidmhl shasthojoy noahlauchina sathyapatel mohnkhan sergeigofman alibabapai mbenisch ourobouros jhseu shyamalschandra panenlei ssghost shubhampachori12110095 powermano waceunmn mutual-ai surandrew hedgefair github30 emma926 jlertle zedzero henrykmichalewski elmarhaussmann stoneyang miturchi fred-fan tgrel the-house-of-black-and-white sb2nov bileschi kevin0259 agahchen keno vishh chenxingqiang mvsusp detectdimples hitlk shlpu freedomtan arunkumarramanan marcucla a-hilaly dshieble craftsliu toddahoffman vinhngx dizcology kakkartushar1 ajdeboer xuyithu jscud ml-lab jerrysparke xiaoyongzhu digideskio xinshaowang hannhu indranig qfdong khasanar502 considerxzh tohaowu madpowen kdatta adamage deregent dor1s tho15 b2220333 revilokeb 0101011 parvizp highwayns shizhiw panchos39 ursk lakshmanok guangzhixie mystery-college-of-the-adapts dreadlord1984 freddierice dstar1 coyotehills hannh

tpu's Issues

Question on import tensorflow as tf

@frankchn Normally in the code, I see import tensorflow as tf but https://github.com/tensorflow/tpu-demos/blob/master/cloud_tpu/diagnostics/diagnostics.py#L24-L30 takes a very different approach. What is this code in diagnostics.py doing that the normal approach does not do? Thanks.

Also a similar question on importing slim in this way https://github.com/tensorflow/tpu-demos/blob/master/cloud_tpu/models/inception_v2/inception_v2.py#L28

Need Russell's contact info

Hi, Russell,
sorry to contact you this way.
I am the program committee chair of R/Pharma conference. You are highly recommended for giving a talk because of your work for openFDA. We have been trying to send you the invitation letter but failed with your idione email. Please could you email me at [email protected] or [email protected] to get the conversation started?

thank you,
Bella

ctpu not working when username has an underscore character

When attempting to allocate a new flock with ctpu such as ctpu up, and with a username with an underscore character (as in first_last) ctpu fails with the following error:

2018/03/13 14:35:08 googleapi: Error 400: Invalid value for field 'resource.name': 'first_last'. Must be a match of regex '(?:[a-z](?:[-a-z0-9]{0,61}[a-z0-9])?)', invalid

This is avoidable by using ctpu -name=first-last up, however it would be nice if this was handled more elegantly by ctpu.

`ctpu delete -tpu_only` doesn't seem to work

From reading the documentation, it seems like passing -tpu_only to delete should cause it to bring down the TPU but not the attached VM. This doesn't seem to happen; instead the VM is also brought down.

I'm using ctpu version 1.3 downloaded from https://dl.google.com/cloud_tpu/ctpu/latest/linux/ctpu .

mlucy@eve:~$ ctpu help delete
ctpu delete [--dry-run] [--tpu-only] [--wait-for-async-ops]
  -dry_run
        Do not make changes; print only what would have happened.
  -name string
        Override the name to use for VMs and TPUs (defaults to your username). (default "mlucy")
  -noconf
        Skip confirmation about deleting resources.
  -nowait
        Don't wait for asynchronous operations to complete (e.g. TPU deletion, Compute Engine VM halting)
  -project string
        Override the GCP project name to use when allocating VMs and TPUs.
               By default, ctpu picks a reasonable value from either your gcloud
               configuration, or the Compute Engine metadata. If a good value cannot be found, you
               will be required to provide a value on the command line.) (default "basilica-211201")
  -tpu_only
        Do not pause the Compute Engine VM, only pause the TPU (useful if you want to edit code on the VM without paying for the TPU).
  -zone string
        Override the Compute Engine zone to use when allocating & deallocating resources.
                By default, it picks a reasonable value from either your gcloud
                configuration, or the Compute Engine metadata. If a good value cannot be found, you
                will be required to provide a value on the command line.) (default "us-central1-b")
mlucy@eve:~$ ctpu delete -name basilica -tpu_only
ctpu will use the following configuration values:
        Name:          basilica
        Zone:          us-central1-b
        GCP Project:   basilica-211201
About to permanently delete your resources. Ok? [Yn]:
2018/07/30 21:15:33 Deleting Compute Engine VM "basilica"...
2018/07/30 21:15:33 Deleting TPU basilica...
All "delete" operations have been initiated successfully. They will
run to completion even if you kill ctpu (e.g. by pressing Ctrl-C). When the
operations have finished running, ctpu will exit. If you would like your shell
back, you can press Ctrl-C now. Note: Next time you run ctpu, you can pass the
--nowait flag to get your shell back immediately.
2018/07/30 21:15:38 Compute Engine operation still running...
2018/07/30 21:15:39 TPU operation still running...
2018/07/30 21:16:00 TPU operation still running...

In addition, there are a couple of quality-of-life problems that showed up in the process:

The documentation lists --tpu-only on the first line, which is not a real flag.
Other ctpu commands use - instead of _ as a word separator, compounding the confusion about the non-existent --tpu-only flag.
The -tpu_only documentation mentions "pausing" the VM, but it seems to be deleted rather than paused.
The prompt for ctpu delete says About to permanently delete your resources. Ok?. It's a little unfortunate that it doesn't mention what resources are being deleted, which would have let me see beforehand that the command was also planning to bring down the VM.

unable to use ctpu again

@saeta My computer suddenly shut down yesterday, when my VM was running. Now, I am unable to get into the VM by $ ctpu up . I get the following errors:
2018/07/11 13:03:47 googleapi: Error 403: Read access to project 'test' was denied, forbidden

any TPU units available?

Are there any TPU units available yet? Either for research or for regular usage?

Amoebanet crashes with custom cell

I ran the amoeba_net code with the exact steps in TPU docs and it ran successfully. However, when I replaced the cell in model_specs.py with my own cell, I get the following error:

I0720 20:39:32.427979 140083899139840 tf_logging.py:115] number of flops: 7344781699.86
I0720 20:39:32.443113 140083899139840 tf_logging.py:115] number of trainable params: 131381594
I0720 20:39:32.454080 140083899139840 tf_logging.py:115] Learning rate warmup_steps: 15012
I0720 20:39:32.466685 140083899139840 tf_logging.py:115] Using RMSProp optimizer
I0720 20:51:53.215569 140083899139840 tf_logging.py:115] Create CheckpointSaverHook.
I0720 20:51:55.015335 140083899139840 tf_logging.py:115] Done calling model_fn.
I0720 20:52:42.132077 140083899139840 tf_logging.py:115] TPU job name worker
I0720 20:53:01.290350 140083899139840 tf_logging.py:115] Graph was finalized.
I0720 20:55:54.060384 140083899139840 tf_logging.py:115] Running local_init_op.
I0720 20:55:59.781912 140083899139840 tf_logging.py:115] Done running local_init_op.
terminate called after throwing an instance of 'std::bad_alloc'                                                                                                                         
  what():  std::bad_alloc                                                                                   
Aborted

For comparison here is the output of same steps with Amoebanet D:

I0720 19:54:57.643646 140361836988160 tf_logging.py:115] number of flops: 4736649009.08
I0720 19:54:57.650213 140361836988160 tf_logging.py:115] number of trainable params: 84812042
I0720 19:54:57.660203 140361836988160 tf_logging.py:115] Learning rate warmup_steps: 15012
I0720 19:54:57.671920 140361836988160 tf_logging.py:115] Using RMSProp optimizer
I0720 19:57:06.877630 140361836988160 tf_logging.py:115] Create CheckpointSaverHook.
I0720 19:57:07.767883 140361836988160 tf_logging.py:115] Done calling model_fn.
I0720 19:57:25.219652 140361836988160 tf_logging.py:115] TPU job name worker
I0720 19:57:32.186690 140361836988160 tf_logging.py:115] Graph was finalized.
I0720 19:58:10.312223 140361836988160 tf_logging.py:115] Running local_init_op.
I0720 19:58:12.237303 140361836988160 tf_logging.py:115] Done running local_init_op.
I0720 19:59:15.582144 140361836988160 tf_logging.py:115] Saving checkpoints for 0 into gs://metalearning/amoeba_net/model.ckpt.
I0720 19:59:34.998261 140361836988160 tf_logging.py:115] gs://metalearning/amoeba_net/model.ckpt-0 is not in all_model_checkpoint_paths. Manually adding it.
I0720 19:59:59.787643 140361836988160 tf_logging.py:115] Installing graceful shutdown hook.
2018-07-20 19:59:59.788119: W tensorflow/core/distributed_runtime/rpc/grpc_session.cc:349] GrpcSession::ListDevices will initialize the session with an empty graph and other defaults because the session has not yet been created.
I0720 19:59:59.806355 140361836988160 tf_logging.py:115] Creating heartbeat manager for ['/job:tpu_worker/replica:0/task:0/device:CPU:0', '/job:tpu_worker/replica:0/task:0/device:XLA_CPU:0']
W0720 19:59:59.811011 140361836988160 tf_logging.py:120] Worker heartbeats not supported by all workers.  No failure handling will be enabled.
I0720 19:59:59.811201 140361836988160 tf_logging.py:115] Init TPU system
I0720 20:00:10.622927 140360158275328 tf_logging.py:115] Starting infeed thread controller.
I0720 20:00:10.623882 140360149882624 tf_logging.py:115] Starting outfeed thread controller.
I0720 20:00:12.328821 140361836988160 tf_logging.py:115] Enqueue next (500) batch(es) of data to infeed.
I0720 20:00:12.329307 140361836988160 tf_logging.py:115] Dequeue next (500) batch(es) of data from outfeed.
I0720 20:33:14.006546 140361836988160 tf_logging.py:115] Saving checkpoints for 500 into gs://metalearning/amoeba_net/model.ckpt.
I0720 20:33:38.237175 140361836988160 tf_logging.py:115] gs://metalearning/amoeba_net/model.ckpt-500 is not in all_model_checkpoint_paths. Manually adding it.
I0720 20:33:54.678931 140361836988160 tf_logging.py:115] loss = 7.4019566, step = 500

The cell we are using is:

Normal cell:
  elif cell_name == 'custom_net_0':
    operations = ['separable_7x7_2', '3x3', 'avg_pool_3x3', '3x3', '1x1', 'separable_5x5_2', 'avg_pool_3x3', '3x3', '3x3', '1x7_7x1']
    hiddenstate_indices= [1, 1, 1, 1, 0, 2, 0, 1, 0, 0]
    used_hiddenstates= [1, 1, 1, 0, 0, 0, 0]

Reduction cell:
  elif cell_name == 'custom_net_0':
    operations = ['1x3_3x1', '1x7_7x1', '3x3', '1x1', 'separable_7x7_2', '1x1', 'separable_3x3_2', 'separable_3x3_2', '3x3', 'separable_5x5_2']
    hiddenstate_indices= [1, 1, 1, 0, 0, 1, 2, 1, 0, 0]
    used_hiddenstates= [1, 1, 1, 0, 0, 0, 0]

Documentation typo

 def focal_loss(logits, targets, alpha, gamma, normalizer):
   """Compute the focal loss between `logits` and the golden `target` values.
   Focal loss = -(1-alpha)^gamma * log(pt)
   where pt is the probability of being classified to the true class.

should probably be

 def focal_loss(logits, targets, alpha, gamma, normalizer):
   """Compute the focal loss between `logits` and the golden `target` values.
   Focal loss = -alpha * (1 - pt) ^ gamma * log(pt)
   where pt is the probability of being classified to the true class.

Attempt to access beyond input size: 4 >= 4

https://github.com/tensorflow/tpu-demos/blob/cb18fe2a4bacf4c8ef7685aebfbffb4550d5e938/cloud_tpu/models/resnet_garden/resnet_main.py#L204

I can't get parallel_interleave to work here. It gives this error:

InvalidArgumentError (see above for traceback): Attempt to access beyond input size: 4 >= 4
	In ParallelInterleaveDataset = ParallelInterleaveDataset[Targuments=[], f=tf_map_func_c72e772a[], output_shapes=[[]], output_types=[DT_STRING]](RepeatDataset:handle:0, ParallelInterleaveDataset/input_pipeline_task0/cycle_length:output:0, ParallelInterleaveDataset/input_pipeline_task0/block_length:output:0, ParallelInterleaveDataset/input_pipeline_task0/sloppy:output:0)
	 [[Node: input_pipeline_task0/OneShotIterator = OneShotIterator[container="", dataset_factory=_make_dataset_737011ca[], output_shapes=[[1024,224,224,3], [1024,1001]], output_types=[DT_FLOAT, DT_FLOAT], shared_name="", _device="/job:tpu_worker/replica:0/task:0/device:CPU:0"]()]]

The ImageNet data is coming from gcs and it's definitely accessible.
It works fine if I replace the line with interleave (no parallel) and remove the apply, but I'm afraid that might slow it down significantly.

Note: my gpu version of this code is happy with parallel_interleave.

Resnet-50 example not working

Hi, I've encountered some issues in the Resnet-50 example on a cloud TPU. I've already corrected them and will file a pull request for them in a few minutes. This issue is intended to describe the problem.

For reference I was running the resnet-50 example with a command:
ipython2 --pdb resnet_main.py -- --tpu_name demo-tpu-3 --data_dir gs://cloud-tpu-test-datasets/fake_imagenet --model_dir gs://cloud-tpu-checkpoint-bucket/resnet50

The first problem was in the usage of tpu_name parameter. It caused the script to crash because it mistook the name of the tpu_name (a string) for an iterable of available TPU names and started to iterate over it.

The second problem was that the reshape operation for the global_step variable was working on int64 dtype. This op on this dtype is not supported for TPUs, therefore the graph compilation for TPU was failing. I fixed it by casting the variable to int32 dtype and then back to int64 on CPU in order to make the summaries work.

After these corrections I'm able to successfully run the example.

Is it possible to train multiple models on same TPU instance?

Trying to train 2 models on the same TPU instance gives the following error:

  File "resnet_main.py", line 430, in <module>
    tf.app.run()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 124, in run
    _sys.exit(main(argv))
  File "resnet_main.py", line 371, in main
    input_fn=ImageNetInput(True), max_steps=next_checkpoint)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 314, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 812, in _train_model
    log_step_count_steps=self._config.log_step_count_steps) as mon_sess:
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 378, in MonitoredTrainingSession
    stop_grace_period_secs=stop_grace_period_secs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 785, in __init__
    stop_grace_period_secs=stop_grace_period_secs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 509, in __init__
    self._sess = _RecoverableSession(self._coordinated_creator)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 970, in __init__
    _WrappedSession.__init__(self, self._create_session())
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 975, in _create_session
    return self._sess_creator.create_session()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 672, in create_session
    hook.after_create_session(self.tf_sess, self.coord)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 548, in after_create_session
    options=config_pb2.RunOptions(timeout_in_ms=5*60*1000))
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 889, in run
    run_metadata_ptr)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1120, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1317, in _do_run
    options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1336, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.DeadlineExceededError: Deadline Exceeded

xrange() was removed in Python 3

flake8 testing of https://github.com/tensorflow/tpu on Python 3.6.3

$ flake8 . --count --select=E901,E999,F821,F822,F823 --show-source --statistics

./models/experimental/inception/inception_v3_old.py:256:18: F821 undefined name 'xrange'
        for i in xrange(0, 984)
                 ^
./models/official/densenet/densenet_model.py:160:16: F821 undefined name 'xrange'
      for j in xrange(depth):
               ^
2       F821 undefined name 'xrange'
2

""No OpKernel was registered" issue when tried to run resnet_model.py without TPU

Using TensorFlow v1.7rc1, tried to run tpu/models/experimental/resnet_bfloat16/resnet_model.py with GPU (without TPU), got "no OpKernel was registered" error.

command to run:
python3 resnet_main.py --use_tpu=False --data_dir=/home/ubuntu/imagenet/train-480px --model_dir=~/model_dir

error info:
InvalidArgumentError (see above for traceback): No OpKernel was registered to support Op 'Conv2D' with these attrs. Registered devices: [CPU,GPU], Registered kernels:
device='CPU'; T in [DT_FLOAT]
device='CPU'; T in [DT_HALF]
device='GPU'; T in [DT_FLOAT]
device='GPU'; T in [DT_HALF]

 [[Node: cg/conv2d/Conv2D = Conv2D[T=DT_BFLOAT16, data_format="NHWC", dilations=[1, 1, 1, 1], padding="VALID", strides=[1, 2, 2, 1], use_cudnn_on_gpu=true](cg/Pad, Cast)]]

What is AmoebaNet-D?

I check the paper and google search but I could not find any information about AmoebaNet-D. In the paper I found only AmoebaNet-A, AmoebaNet-B, and AmoebaNet-C. What is AmoebaNet-D?

Accelerate inference times of pre-trained models

Is it possible to accelerate the inference time of a pre-trained model (e.g. from the detection model zoo) with TPUs without having to adapt the model to use the estimator API?

`ctpu status` crashes

I'm using ctpu version 1.3 downloaded from https://dl.google.com/cloud_tpu/ctpu/latest/linux/ctpu .

I get the following stack trace:

mlucy@eve:~$ ctpu status
Your cluster is running!
        Compute Engine VM:  RUNNING
        Cloud TPU:          RUNNING
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x671b7e]

goroutine 1 [running]:
github.com/tensorflow/tpu/tools/ctpu/commands.(*statusCmd).Execute(0xc4200915f0, 0x770040, 0xc4200a0000, 0xc42009e780, 0x0, 0x0, 0x0, 0x6dddc0)
        /tmp/ctpu-release/src/github.com/tensorflow/tpu/tools/ctpu/commands/status.go:214 +0x5ce
github.com/google/subcommands.(*Commander).Execute(0xc4200a4000, 0x770040, 0xc4200a0000, 0x0, 0x0, 0x0, 0x5)
        /tmp/ctpu-release/src/github.com/google/subcommands/subcommands.go:141 +0x29f
github.com/google/subcommands.Execute(0x770040, 0xc4200a0000, 0x0, 0x0, 0x0, 0xc4200b0540)
        /tmp/ctpu-release/src/github.com/google/subcommands/subcommands.go:385 +0x5f
main.main()
        /tmp/ctpu-release/src/github.com/tensorflow/tpu/tools/ctpu/main.go:87 +0xd5e

InvalidArgumentError: Cannot assign a device for operation

Hi guys,

I am new to TPU and I encounter an error when I try to train a CNN (on TPU) on my own dataset. I am able to run MNIST example and run my code when --use_tpu=False. But when I set --use_tpu=Ture, I will have InvalidArgumentError.

I System
The tf-1-7 image, as mentioned in the tutorial.

II Error

2018-04-17 06:35:31.602915: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1327, in _do_call
return fn(*args)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1310, in _run_fn
self._extend_graph()
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1358, in _extend_graph
graph_def.SerializeToString(), status)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/errors_impl.py", line 516, in exit
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot assign a device for operation 'input_pipeline_task0/IteratorToStringHandle': Operation was explicitly assigned to /job:tpu_worker/task:0/device:CPU:0 but available devices are [ /job:localhost/replica:0/task:0/device:CPU:0 ]. Make sure the device specification refers to a valid device.
[[Node: input_pipeline_task0/IteratorToStringHandle = IteratorToStringHandle_device="/job:tpu_worker/task:0/device:CPU:0"]]

My thought is that it is caused by the way I input the data. Now, I use pickle to load the data in the local machine and then use from_tensor_slices to import data. I tried to save data in google cloud storage but it will report "the file cannot be found."

Any solutions? Or better way to import data?
Many thanks!

Do TPUs run all possible tensorflow computation graphs?

I want to create only computation graphs that are compatible with TPUs. But I cannot find how to create computation graphs that are compatible with TPUs. Compounding the issue is the fact that I do not intend to ever do tensorflow development with billing enabled.

Please Google, make a detailed breakdown of what is supported and what is not supported on TPUs, or create a 'gimped' Cloud TPU environment that does not return results but has the only purpose to test if the code generates computation graphs that can run on a TPU.

script link mentioned in tpu/models/official/resnet/README.md is not accessible

The link(https://github.com/tensorflow/tpu/blob/master/cloud_tpu/datasets/imagenet_to_gcs.py) in tpu/models/official/resnet/README.md is not accessible, seems the page is removed, please update it. Maybe this one -https://raw.githubusercontent.com/tensorflow/tpu/master/tools/datasets/imagenet_to_gcs.py ?

ctpu up failed. Maybe the recently code .

In the past several days, I used 'ctpu up' to create VM & TPU。
It worked well.

But I tried many times today, it always failed.
This is the output:

aaflier@cloudshell:~ (ck-augment-208602)$ ctpu up
ctpu will use the following configuration:

  Name:                 aaflier
  Zone:                 us-central1-b
  GCP Project:          ck-augment-208602
  TensorFlow Version:   1.9
  VM:
      Machine Type:     n1-standard-2
      Disk Size:        250 GB
      Preemptible:      false
  Cloud TPU:
      Size:             v2-8
      Preemptible:      false

OK to create your Cloud TPU resources with the above configuration? [Yn]:                                         Y
2018/08/01 16:20:02 Creating Compute Engine VM aaflier (this may take a minute)...
2018/08/01 16:20:02 Creating TPU aaflier (this may take a few minutes)...
2018/08/01 16:20:09 Created Compute Engine VM aaflier!
2018/08/01 16:20:10 TPU operation still running...
2018/08/01 16:20:32 TPU operation still running...
2018/08/01 16:20:54 TPU operation still running...
2018/08/01 16:21:17 TPU operation still running...
2018/08/01 16:21:39 TPU operation still running...
2018/08/01 16:22:01 TPU operation still running...
2018/08/01 16:22:24 TPU operation still running...
2018/08/01 16:22:46 TPU operation still running...
2018/08/01 16:23:08 TPU operation still running...
2018/08/01 16:23:30 TPU operation still running...
2018/08/01 16:23:52 TPU operation still running...
2018/08/01 16:24:13 Created TPU aaflier!
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x6677b3]
goroutine 1 [running]:
github.com/tensorflow/tpu/tools/ctpu/ctrl.(*ResourceManagementCP).IsProjectInGoogleOrg(0xc4200511b0, 0x773340, 0xc420052dc0, 0xc42000c798)
        /tmp/ctpu-release/src/github.com/tensorflow/tpu/tools/ctpu/ctrl/resourcemgmt.go:119 +0xe3
github.com/tensorflow/tpu/tools/ctpu/commands.(*upCmd).Execute(0xc4200795e0, 0x773380, 0xc4200160f0, 0xc4200568a0, 0x0, 0x0, 0x0, 0x6e05a0)
        /tmp/ctpu-release/src/github.com/tensorflow/tpu/tools/ctpu/commands/up.go:449 +0x2c3
github.com/google/subcommands.(*Commander).Execute(0xc420070000, 0x773380, 0xc4200160f0, 0x0, 0x0, 0x0, 0x5)
        /tmp/ctpu-release/src/github.com/google/subcommands/subcommands.go:141 +0x29f
github.com/google/subcommands.Execute(0x773380, 0xc4200160f0, 0x0, 0x0, 0x0, 0xc420052700)
        /tmp/ctpu-release/src/github.com/google/subcommands/subcommands.go:385 +0x5f
main.main()
        /tmp/ctpu-release/src/github.com/tensorflow/tpu/tools/ctpu/main.go:87 +0xd5e
aaflier@cloudshell:~ (ck-augment-208602)$ ctpu delete

This error, which would be the golang equivalent of a null pointer exception, appears to be thrown from this[1] goroutine which checks whether your project belongs to the google.com organization (there are some different types of restrictions for such projects). I can see that your project does not belong to any organization, which could explain the null pointer. Also, considering that this piece of code was added recently (the corresponding merge seems to be from two days ago[2]), it would make sense that you didn't face this problem earlier.

If that is the case indeed, this is an issue on our side. However, I would like to communicate this to my colleagues to confirm my thoughts. I will follow up with you later within the day.

[1] https://github.com/tensorflow/tpu/blob/master/tools/ctpu/ctrl/resourcemgmt.go#L111
[2] 9f66664#diff-44649e4ff88feeb0adeb5df4b70076ee

Ways to freeze RetinaNet to a .pb file?

Is there a way to freeze RetinaNet checkpoint to a .pb file for further referencing, after it got trained? From my limited knowledge, there are two ways to convert a checkpoint to a .pb file in TF, which are all impossible to convert the trained RetinaNet model to .pb file.

use the freeze_graph tool by TensorFlow, as described here (https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/tools/freeze_graph.py). However this command requires to specify output_node_names parameter which is hard to get for RetinaNet by analyzing its graph or using the summarize_graph provided (https://github.com/tensorflow/tensorflow/tree/master/tensorflow/tools/graph_transforms#inspecting-graphs). The summarize_graph tool will give over 1,000 possible names.
Use the export_inference_graph tool provided by the Object Detection API (https://github.com/tensorflow/models/blob/master/research/object_detection/export_inference_graph.py), which requires the model definition, which does not exist yet for RetinaNet.

So my question is - what's the best way to freeze the trained RetinaNet model to a .pb file for further inference?

AttributeError when running resnet_benchmark.py

Hi!

I am getting the following problem when running the resnet benchmark on a TPU. Any idea how to fix this? I googled already but couldn't find any pointers. The relevant part is the AttributeError. I am running TF 1.9 on my machine.

cezary@cezary:/usr/share/tpu/models/official/resnet/benchmark$ python resnet_benchmark.py --tpu=$TPU_NAME --mode=train --data_dir=$DATA_DIR --model_dir=gs://cezary-bucket/resnet --train_batch_size=1024 --train_steps=112590 --iterations _per_loop=1251 /usr/local/lib/python2.7/dist-packages/h5py/__init__.py:36: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88 from ._conv import register_converters as _register_converters /usr/local/lib/python2.7/dist-packages/h5py/__init__.py:45: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88 from . import h5a, h5d, h5ds, h5f, h5fd, h5g, h5r, h5s, h5t, h5p, h5z /usr/local/lib/python2.7/dist-packages/h5py/_hl/group.py:22: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88 from .. import h5g, h5i, h5o, h5r, h5t, h5l, h5p /usr/local/lib/python2.7/dist-packages/scipy/sparse/lil.py:19: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88 from . import _csparsetools /usr/local/lib/python2.7/dist-packages/scipy/sparse/csgraph/__init__.py:165: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88 from ._shortest_path import shortest_path, floyd_warshall, dijkstra,\ /usr/local/lib/python2.7/dist-packages/scipy/sparse/csgraph/_validation.py:5: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88 from ._tools import csgraph_to_dense, csgraph_from_dense,\ /usr/local/lib/python2.7/dist-packages/scipy/sparse/csgraph/__init__.py:167: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88 from ._traversal import breadth_first_order, depth_first_order, \ /usr/local/lib/python2.7/dist-packages/scipy/sparse/csgraph/__init__.py:169: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88 from ._min_spanning_tree import minimum_spanning_tree /usr/local/lib/python2.7/dist-packages/scipy/sparse/csgraph/__init__.py:170: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88 from ._reordering import reverse_cuthill_mckee, maximum_bipartite_matching, \ /usr/local/lib/python2.7/dist-packages/scipy/linalg/basic.py:17: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88 from ._solve_toeplitz import levinson /usr/local/lib/python2.7/dist-packages/scipy/linalg/__init__.py:207: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88 from ._decomp_update import * /usr/local/lib/python2.7/dist-packages/scipy/special/__init__.py:640: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88 from ._ufuncs import * /usr/local/lib/python2.7/dist-packages/scipy/special/_ellip_harm.py:7: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88 from ._ellip_harm_2 import _ellipsoid, _ellipsoid_norm /usr/local/lib/python2.7/dist-packages/scipy/interpolate/_bsplines.py:10: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88 from . import _bspl /usr/local/lib/python2.7/dist-packages/scipy/spatial/__init__.py:95: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88 from .ckdtree import * /usr/local/lib/python2.7/dist-packages/scipy/spatial/__init__.py:96: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88 from .qhull import * /usr/local/lib/python2.7/dist-packages/scipy/spatial/_spherical_voronoi.py:18: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88 from . import _voronoi /usr/local/lib/python2.7/dist-packages/scipy/spatial/distance.py:122: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88 from . import _hausdorff /usr/local/lib/python2.7/dist-packages/scipy/ndimage/measurements.py:36: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88 from . import _ni_label W0808 14:05:08.016391 140579100919552 __init__.py:44] file_cache is unavailable when using oauth2client >= 4.0.0 or google-auth Traceback (most recent call last): File "/usr/local/lib/python2.7/dist-packages/googleapiclient/discovery_cache/__init__.py", line 41, in autodetect from . import file_cache File "/usr/local/lib/python2.7/dist-packages/googleapiclient/discovery_cache/file_cache.py", line 41, in <module> 'file_cache is unavailable when using oauth2client >= 4.0.0 or google-auth') ImportError: file_cache is unavailable when using oauth2client >= 4.0.0 or google-auth WARNING:tensorflow:Estimator's model_fn (<function resnet_model_fn at 0x7fdafd451b90>) includes params argument, but params are not passed to Estimator. W0808 14:05:08.232148 140579100919552 tf_logging.py:125] Estimator's model_fn (<function resnet_model_fn at 0x7fdafd451b90>) includes params argument, but params are not passed to Estimator. INFO:tensorflow:Using config: {'_save_checkpoints_secs': None, '_session_config': allow_soft_placement: true cluster_def { job { name: "worker" tasks { value: "10.240.1.2:8470" } } } , '_keep_checkpoint_max': None, '_task_type': 'worker', '_train_distribute': None, '_is_chief': True, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fdafcdf39d0>, '_model_dir': 'gs://cezary-bucket/resne t', '_save_checkpoints_steps': 1251, '_keep_checkpoint_every_n_hours': 10000, '_service': None, '_num_ps_replicas': 0, '_tpu_config': TPUConfig(iterations_per_loop=1251, num_shards=8, computation_shape=None, per_host_input_for_training= 3, tpu_job_name=None, initial_infeed_sleep_secs=None), '_tf_random_seed': None, '_save_summary_steps': 100, '_device_fn': None, '_cluster': <tensorflow.contrib.cluster_resolver.python.training.tpu_cluster_resolver.TPUClusterResolver obj ect at 0x7fdafd457a50>, '_num_worker_replicas': 1, '_task_id': 0, '_log_step_count_steps': None, '_evaluation_master': u'grpc://10.240.1.2:8470', '_global_id_in_cluster': 0, '_master': u'grpc://10.240.1.2:8470'} I0808 14:05:08.233087 140579100919552 tf_logging.py:115] Using config: {'_save_checkpoints_secs': None, '_session_config': allow_soft_placement: true cluster_def { job { name: "worker" tasks { value: "10.240.1.2:8470" } } } , '_keep_checkpoint_max': None, '_task_type': 'worker', '_train_distribute': None, '_is_chief': True, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fdafcdf39d0>, '_model_dir': 'gs://cezary-bucket/resne t', '_save_checkpoints_steps': 1251, '_keep_checkpoint_every_n_hours': 10000, '_service': None, '_num_ps_replicas': 0, '_tpu_config': TPUConfig(iterations_per_loop=1251, num_shards=8, computation_shape=None, per_host_input_for_training= 3, tpu_job_name=None, initial_infeed_sleep_secs=None), '_tf_random_seed': None, '_save_summary_steps': 100, '_device_fn': None, '_cluster': <tensorflow.contrib.cluster_resolver.python.training.tpu_cluster_resolver.TPUClusterResolver obj ect at 0x7fdafd457a50>, '_num_worker_replicas': 1, '_task_id': 0, '_log_step_count_steps': None, '_evaluation_master': u'grpc://10.240.1.2:8470', '_global_id_in_cluster': 0, '_master': u'grpc://10.240.1.2:8470'} INFO:tensorflow:_TPUContext: eval_on_tpu True I0808 14:05:08.233470 140579100919552 tf_logging.py:115] _TPUContext: eval_on_tpu True Traceback (most recent call last): File "resnet_benchmark.py", line 152, in <module> tf.app.run() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 125, in run _sys.exit(main(argv)) File "resnet_benchmark.py", line 82, in main batches_per_epoch = resnet_main.NUM_TRAIN_IMAGES / FLAGS.train_batch_size AttributeError: 'module' object has no attribute 'NUM_TRAIN_IMAGES'

GCP Auth issues with ctpu

Hi I am unable to use ctpu through cloud shell, it worked first time then it stopped working.

akshayubhat@cloudshell:~ (dvatfrc)$ ctpu ls
2018/06/22 23:58:36 Error listing Cloud TPUs: googleapi: Error 403: Read access to project 'dvatfrc' was denied, forbidden
akshayubhat@cloudshell:~ (dvatfrc)$ ctpu --zone us-central-f ls
2018/06/22 23:58:50 Error listing Cloud TPUs: googleapi: Error 403: Read access to project 'dvatfrc' was denied, forbidden
akshayubhat@cloudshell:~ (dvatfrc)$ ctpu --zone us-central-f ls
2018/06/22 23:58:55 Error listing Cloud TPUs: googleapi: Error 403: Read access to project 'dvatfrc' was denied, forbidden
akshayubhat@cloudshell:~ (dvatfrc)$

It also wont use gcloud config e.g. I tried changing the region/zone but it does not changes when I view it using ctpu cfg. It's also not clear how to reset ctpu, restarting the VM did not fix this.

UnavailableError: Socket closed : amoeba_net

Using:

code: experimental/amoeba_net/amoeba_net.py
branch: r1.8

I am able to train the model using amoeba_net_x (x = a,b,c,d). However, when I change the cell to a custom cell (keeping all other parameters same), I am unable to train the model. The training starts normally but I keep getting Socket closed error during the training. (most of the cycles fail with this error, some don't)

Complete Logs:

I0706 19:04:54.077939 140690244269824 tf_logging.py:116] Graph was finalized.
I0706 19:04:54.352663 140690244269824 tf_logging.py:116] Restoring parameters from <data_dir>/model.ckpt-6000
I0706 19:07:55.526499 140690244269824 tf_logging.py:116] Running local_init_op.
I0706 19:08:00.698863 140690244269824 tf_logging.py:116] Done running local_init_op.
I0706 19:08:27.570785 140690244269824 tf_logging.py:116] Init TPU system
I0706 19:08:44.705221 140690244269824 tf_logging.py:116] Start infeed thread controller
I0706 19:08:44.705890 140687278339840 tf_logging.py:116] Starting infeed thread controller.
I0706 19:08:44.706137 140690244269824 tf_logging.py:116] Start outfeed thread controller
I0706 19:08:44.706412 140687290922752 tf_logging.py:116] Starting outfeed thread controller.
I0706 19:08:55.460019 140690244269824 tf_logging.py:116] Enqueue next (500) batch(es) of data to infeed.
I0706 19:08:55.460402 140690244269824 tf_logging.py:116] Dequeue next (500) batch(es) of data from outfeed.
W0706 19:21:49.613796 140687278339840 tf_logging.py:126] 

Error occurred during infeed/outfeed.  This may be due to a compile error in the main session.  Waiting for a short time for the main session to come back.

Socket closed
I0706 19:21:49.615422 140690244269824 tf_logging.py:116] An error was raised. This may be due to a preemption in a connected worker or parameter server. The current session will be closed and a new session will be created. Error: Socket closed
W0706 19:21:49.616436 140687290922752 tf_logging.py:126] 

Error occurred during infeed/outfeed.  This may be due to a compile error in the main session.  Waiting for a short time for the main session to come back.

Socket closed
E0706 19:21:54.622838 140687519512320 tf_logging.py:106] Feed error: Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 447, in _run_outfeed
    session.run(self._dequeue_ops)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 900, in run
    run_metadata_ptr)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1135, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1316, in _do_run
    run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1335, in _do_call
    raise type(e)(node_def, op, message)
UnavailableError: Socket closed

E0706 19:21:54.623138 140687519512320 tf_logging.py:106] Closing session.  A RuntimeError should follow.
I0706 19:21:55.618206 140690244269824 tf_logging.py:116] Graph was finalized.
I0706 19:21:55.947227 140690244269824 tf_logging.py:116] Restoring parameters from <data_dir>/model.ckpt-6000

and it reapeats so on

[retinanet] Error when training custom dataset

Hi,

I'm trying to train retinanet on my own dataset.
The tf records were already generated for the object detection api which apparently uses the same tf_example format used here.

Any help is appreciated.

See below the traceback:

INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:TPU job name tpu_worker
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Init TPU system
INFO:tensorflow:Start infeed thread controller
INFO:tensorflow:Start outfeed thread controller
INFO:tensorflow:Starting infeed thread controller.
INFO:tensorflow:Starting outfeed thread controller.
INFO:tensorflow:Enqueue next (100) batch(es) of data to infeed.
INFO:tensorflow:Dequeue next (100) batch(es) of data from outfeed.
Traceback (most recent call last):
  File "tpu/models/official/retinanet/retinanet_main.py", line 244, in <module>
    tf.app.run(main)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 126, in run
    _sys.exit(main(argv))
  File "tpu/models/official/retinanet/retinanet_main.py", line 160, in main
    FLAGS.train_batch_size))
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 352, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 891, in _train_model
    _, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 546, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1022, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1113, in run
    raise six.reraise(*original_exc_info)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1098, in run
    return self._sess.run(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1170, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 950, in run
    return self._sess.run(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 905, in run
    run_metadata_ptr)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1137, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1355, in _do_run
    options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1374, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Gradient for resnet50/batch_normalization_50/gamma:0 is NaN : Tensor had NaN values
	 [[Node: CheckNumerics_151 = CheckNumerics[T=DT_FLOAT, message="Gradient for resnet50/batch_normalization_50/gamma:0 is NaN", _device="/job:tpu_worker/replica:0/task:0/device:CPU:0"](Identity_577)]]

Caused by op u'CheckNumerics_151', defined at:
  File "tpu/models/official/retinanet/retinanet_main.py", line 244, in <module>
    tf.app.run(main)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 126, in run
    _sys.exit(main(argv))
  File "tpu/models/official/retinanet/retinanet_main.py", line 160, in main
    FLAGS.train_batch_size))
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 352, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 812, in _train_model
    features, labels, model_fn_lib.ModeKeys.TRAIN, self.config)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 793, in _call_model_fn
    model_fn_results = self._model_fn(features=features, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2091, in _model_fn
    update_ops = _sync_variables_ops()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 129, in _sync_variables_ops
    for v in variables.trainable_variables()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_array_ops.py", line 498, in check_numerics
    "CheckNumerics", tensor=tensor, message=message, name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 3271, in create_op
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1650, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

InvalidArgumentError (see above for traceback): Gradient for resnet50/batch_normalization_50/gamma:0 is NaN : Tensor had NaN values
	 [[Node: CheckNumerics_151 = CheckNumerics[T=DT_FLOAT, message="Gradient for resnet50/batch_normalization_50/gamma:0 is NaN", _device="/job:tpu_worker/replica:0/task:0/device:CPU:0"](Identity_577)]]

resnet_main: use_tpu=False fails on eval

When testing with use_tpu=False and data_format=channels_first on a device with GPU, I face an error:

Traceback (most recent call last):
  File "tinyin_main.py", line 428, in <module>
    tf.app.run()
  File "/home/karan/anaconda2/envs/tf/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 126, in run
    _sys.exit(main(argv))
  File "tinyin_main.py", line 414, in main
    steps=NUM_EVAL_IMAGES // FLAGS.eval_batch_size)
  File "/home/karan/anaconda2/envs/tf/lib/python2.7/site-packages/tensorflow/python/estimator/estimator.py", line 416, in evaluate
    name=name)
  File "/home/karan/anaconda2/envs/tf/lib/python2.7/site-packages/tensorflow/python/estimator/estimator.py", line 930, in _evaluate_model
    features, labels, model_fn_lib.ModeKeys.EVAL, self.config)
  File "/home/karan/anaconda2/envs/tf/lib/python2.7/site-packages/tensorflow/python/estimator/estimator.py", line 804, in _call_model_fn
    model_fn_results = self._model_fn(features=features, **kwargs)
  File "/home/karan/anaconda2/envs/tf/lib/python2.7/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 1810, in _model_fn
    return model_fn_wrapper.call_without_tpu(features, labels)
  File "/home/karan/anaconda2/envs/tf/lib/python2.7/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 1040, in call_without_tpu
    return self._call_model_fn(features, labels)
  File "/home/karan/anaconda2/envs/tf/lib/python2.7/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 1229, in _call_model_fn
    estimator_spec = self._model_fn(features=features, **kwargs)
  File "tinyin_main.py", line 203, in resnet_model_fn
    inputs=features, is_training=(mode == tf.estimator.ModeKeys.TRAIN))
  File "/home/karan/metalearning-project/tpu/models/official/resnet/tinyin_model.py", line 277, in model
    data_format=data_format)
  File "/home/karan/metalearning-project/tpu/models/official/resnet/tinyin_model.py", line 127, in conv2d_fixed_padding
    data_format=data_format)
  File "/home/karan/anaconda2/envs/tf/lib/python2.7/site-packages/tensorflow/python/layers/convolutional.py", line 619, in conv2d
    return layer.apply(inputs)
  File "/home/karan/anaconda2/envs/tf/lib/python2.7/site-packages/tensorflow/python/layers/base.py", line 815, in apply
    return self.__call__(inputs, *args, **kwargs)
  File "/home/karan/anaconda2/envs/tf/lib/python2.7/site-packages/tensorflow/python/layers/base.py", line 688, in __call__
    self.build(input_shapes)
  File "/home/karan/anaconda2/envs/tf/lib/python2.7/site-packages/tensorflow/python/layers/convolutional.py", line 133, in build
    raise ValueError('The channel dimension of the inputs '
ValueError: The channel dimension of the inputs should be defined. Found `None`.

`tf.Print` doesn't print on CPU.

How does one get tf.Print working inside the TPU? I tried running it on CPU with use_tpu set to False and added a few tf.Print lines inside the model_fn. Even though it runs without errors, I am not able to see the output of the tf.Print commands.

Using r1.6 branch and official/resnet code.

ResNet-50 error when exporting saved model

With the r1.8 branch running on GCE Cloud TPU, the step of export saved model (https://github.com/tensorflow/tpu/blob/r1.8/models/official/resnet/resnet_main.py#L472) failed with the following error. Note that this occurs only when export_dir is provided.

Traceback (most recent call last):
  File "resnet_main.py", line 485, in <module>
    tf.app.run()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 126, in run
    _sys.exit(main(argv))
  File "resnet_main.py", line 481, in main
    serving_input_receiver_fn=imagenet_input.image_serving_input_fn)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 635, in export_savedmodel
    saver_for_restore.restore(session, checkpoint_path)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1802, in restore
    {self.saver_def.filename_tensor_name: save_path})
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 900, in run
    run_metadata_ptr)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1135, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1316, in _do_run
    run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1335, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Assign requires shapes of both tensors to match. lhs shape= [7,7,224,64] rhs shape= [7,7,3,64]
         [[Node: save/Assign_212 = Assign[T=DT_FLOAT, _class=["loc:@bfloat16/conv2d/kernel"], use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/device:CPU:0"](bfloat16/conv2d/kernel, save/RestoreV2:212)]]

Caused by op u'save/Assign_212', defined at:
  File "resnet_main.py", line 485, in <module>
    tf.app.run()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 126, in run
    _sys.exit(main(argv))
  File "resnet_main.py", line 481, in main
    serving_input_receiver_fn=imagenet_input.image_serving_input_fn)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 634, in export_savedmodel
    sharded=True)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1338, in __init__
    self.build()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1347, in build
    self._build(self._filename, build_save=True, build_restore=True)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1384, in _build
    build_save=build_save, build_restore=build_restore)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 829, in _build_internal
    restore_sequentially, reshape)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 525, in _AddShardedRestoreOps
    name="restore_shard"))
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 494, in _AddRestoreOps
    assign_ops.append(saveable.restore(saveable_tensors, shapes))
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 185, in restore
    self.op.get_shape().is_fully_defined())
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/state_ops.py", line 283, in assign
    validate_shape=validate_shape)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_state_ops.py", line 60, in assign
    use_locking=use_locking, name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 3392, in create_op
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1718, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

InvalidArgumentError (see above for traceback): Assign requires shapes of both tensors to match. lhs shape= [7,7,224,64] rhs shape= [7,7,3,64]
         [[Node: save/Assign_212 = Assign[T=DT_FLOAT, _class=["loc:@bfloat16/conv2d/kernel"], use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/device:CPU:0"](bfloat16/conv2d/kernel, save/RestoreV2:212)]]

ResNet50 - Training on custom data

The data input pipeline follows ImageNet.
For training, I just changed dataset constants in resnet_main.py

However I came across this problem, how to solve it? Thank you so much.

I0530 13:10:29.993745 140350547060480 tf_logging.py:116] Init TPU system
Traceback (most recent call last):
File "resnet_main_webface_V01.py", line 508, in
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 126, in run
_sys.exit(main(argv))
File "resnet_main_webface_V01.py", line 440, in main
input_fn=ImageNetInput(True), max_steps=next_checkpoint)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 352, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 888, in _train_model
log_step_count_steps=self._config.log_step_count_steps) as mon_sess:
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 384, in MonitoredTrainingSession
stop_grace_period_secs=stop_grace_period_secs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 795, in init
stop_grace_period_secs=stop_grace_period_secs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 518, in init
self._sess = _RecoverableSession(self._coordinated_creator)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 981, in init
_WrappedSession.init(self, self._create_session())
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 986, in _create_session
return self._sess_creator.create_session()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 681, in create_session
hook.after_create_session(self.tf_sess, self.coord)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 691, in after_create_session
options=config_pb2.RunOptions(timeout_in_ms=5 * 60 * 1000))
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 905, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1137, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1355, in _do_run
options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1374, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.DeadlineExceededError: Deadline Exceeded

ctpu version issues

Currently, when I query the available versions with ctpu, I get the following output:

$ ctpu tf-versions
2018/03/13 14:24:49 WARNING: Setting zone to "us-central1-c"
Cloud TPU TensorFlow Versions:
        1.7     (default version)
        1.6

if I then attempt to start a flock the following error is generated:

$ ctpu up
2018/03/13 14:34:07 WARNING: Setting zone to "us-central1-c"
2018/03/13 14:34:08 Creating GCE VM first-last(this may take a minute)...
2018/03/13 14:34:08 Creating TPU first-last (this may take a few minutes)...
2018/03/13 14:34:08 could not create GCE Instance without a base image

Also, when starting a new flock with $ ctpu up -tf-version=1.6 the version of TF on the resulting VM is 1.6.0-rc0, not the latest release version.

Resnet model doesn't work with no TPU and `use_tpu=false` set.

https://github.com/tensorflow/tpu/tree/master/models/official/resnet doesn't appear to work correctly when --use_tpu=false is passed.

When I do so (and no TPU is attached to the machine), I get this error:

Traceback (most recent call last):
  File "/tensorflow_tpu_models/models/official/resnet/resnet_main.py", line 506, in <module>
    tf.app.run()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "/tensorflow_tpu_models/models/official/resnet/resnet_main.py", line 395, in main
    project=FLAGS.gcp_project)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/cluster_resolver/python/training/tpu_cluster_resolver.py", line 150, in __init__
    self._tpu = compat.as_bytes(tpu)  # self._tpu is always bytes
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/compat.py", line 61, in as_bytes
    (bytes_or_text,))
TypeError: Expected binary or unicode string, got None

(I'm using Tensorflow 1.9 .)

Tensorboard not refreshing when checkpoints are saved

I came across a minor issue while using TPU, I used buckets for storing checkpoints and used tensorboard for visualising the scalars involved in the training. But the loss graphs are not automatically updated. I have to manually kill the tensorboard and restart to see the new checkpoint information.

Can this scenario be fixed ?

Prediction fails

https://github.com/tensorflow/tpu/tree/master/models/experimental/resnet_bfloat16

The link above says "To run the same code on CPU/GPU, set the flag --use_tpu=False" but after training on the TPU, evaluation and prediction fails with the error

InvalidArgumentError (see above for traceback): No OpKernel was registered to support Op 'Conv2D' with these attrs.  Registered devices: [CPU], Registered kernels:
  device='CPU'; T in [DT_DOUBLE]
  device='CPU'; T in [DT_FLOAT]
  device='CPU'; T in [DT_HALF]

         [[Node: bfloat16/conv2d/Conv2D = Conv2D[T=DT_BFLOAT16, data_format="NHWC", dilations=[1, 1, 1, 1], padding="VALID", strides=[1, 2, 2, 1], use_cudnn_on_gpu=true](bfloat16/Pad, Cast)]]

trying the google cloud shell tutorial ...

Solved: I think I've "solved" my problem, by just "Upgrading" my account. So please ignore.

Was:
While trying the google cloud shell tutorial, and having setup a google cloud account, I get:

2018/05/14 15:13:25 googleapi: Error 403: cannot create a TPU Node on a project linked to a free-trial billing account, forbidden

Can someone fix this please and/or tell me what else I can do to make it work? Thanks.

Can U provide an inception v3/v4 model using slim API and support check point for transfer learning

I see an inception v3/v4 model using Estimator API has been provided,
Can U provide an inception v3/v4 model using slim API and support check point for transfer learning.
Thank you very much.

Pass run_metadata to TPUEstimator?

In order to profile the time and memory of TF workloads, according to the page below, one need to create a run_metadata = tf.RunMetadata(), and pass the run_metadata to sess.run().

https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/profiler/g3doc/python_api.md

Do we have a way to pass run_metadata to TPUEstimator, or profile memory of TPU workloads in other ways?

Very slow training when following resnet tutorial, possibly because of file_cache error

I'm following the instructions in https://cloud.google.com/tpu/docs/tutorials/resnet , and I get about 40 examples/sec of throughput:

I0724 03:44:09.470874 140435012744960 tf_logging.py:115] loss = 1.5828536, step = 200 (2471.960 sec)
I0724 03:44:09.472132 140435012744960 tf_logging.py:115] global_step/sec: 0.0404537
I0724 03:44:09.942822 140435012744960 tf_logging.py:115] examples/sec: 41.4246

The following error occurs earlier in the output:

Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/googleapiclient/discovery_cache/__init__.py", line 41, in autodetect
    from . import file_cache
  File "/usr/local/lib/python2.7/dist-packages/googleapiclient/discovery_cache/file_cache.py", line 41, in <module>
    'file_cache is unavailable when using oauth2client >= 4.0.0 or google-auth')
ImportError: file_cache is unavailable when using oauth2client >= 4.0.0 or google-auth

Here's the whole log, if it's helpful:

mlucy@mlucy:/usr/share/tpu/models/official/resnet$ python resnet_main.py   --tpu=$TPU_NAME   --data_dir=$DATA_DIR   --model_dir=${STORAGE_BUCKET}/images/generic/resnet
W0724 02:39:43.722238 140435012744960 __init__.py:44] file_cache is unavailable when using oauth2client >= 4.0.0 or google-auth
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/googleapiclient/discovery_cache/__init__.py", line 41, in autodetect
    from . import file_cache
  File "/usr/local/lib/python2.7/dist-packages/googleapiclient/discovery_cache/file_cache.py", line 41, in <module>
    'file_cache is unavailable when using oauth2client >= 4.0.0 or google-auth')
ImportError: file_cache is unavailable when using oauth2client >= 4.0.0 or google-auth
W0724 02:39:43.877512 140435012744960 tf_logging.py:125] Estimator's model_fn (<function resnet_model_fn at 0x7fb975b9ec08>) includes params argument, but params are not passed to Estimator.
I0724 02:39:43.878710 140435012744960 tf_logging.py:115] Using config: {'_save_checkpoints_secs': None, '_session_config': allow_soft_placement: true
cluster_def {
  job {
    name: "worker"
    tasks {
      value: "10.240.1.2:8470"
    }
  }
}
, '_keep_checkpoint_max': 5, '_task_type': 'worker', '_train_distribute': None, '_is_chief': True, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fb97528d450>, '_model_di
r': 'gs://basilica/images/generic/resnet', '_save_checkpoints_steps': 600, '_keep_checkpoint_every_n_hours': 10000, '_service': None, '_num_ps_replicas': 0, '_tpu_config': TPUConfig(iterations_per_loop=10
0, num_shards=8, computation_shape=None, per_host_input_for_training=3, tpu_job_name=None, initial_infeed_sleep_secs=None), '_tf_random_seed': None, '_save_summary_steps': 100, '_device_fn': None, '_clust
er': <tensorflow.contrib.cluster_resolver.python.training.tpu_cluster_resolver.TPUClusterResolver object at 0x7fb9758ce290>, '_num_worker_replicas': 1, '_task_id': 0, '_log_step_count_steps': None, '_eval
uation_master': u'grpc://10.240.1.2:8470', '_global_id_in_cluster': 0, '_master': u'grpc://10.240.1.2:8470'}
I0724 02:39:43.878993 140435012744960 tf_logging.py:115] _TPUContext: eval_on_tpu True
I0724 02:39:43.879216 140435012744960 tf_logging.py:115] Precision: bfloat16
I0724 02:39:44.232106 140435012744960 tf_logging.py:115] Training for 112603 steps (90.00 epochs in total). Current step 0.
I0724 02:39:44.529622 140435012744960 tf_logging.py:115] Querying Tensorflow master (grpc://10.240.1.2:8470) for TPU system metadata.
2018-07-24 02:39:44.531309: W tensorflow/core/distributed_runtime/rpc/grpc_session.cc:349] GrpcSession::ListDevices will initialize the session with an empty graph and other defaults because the session has not yet been created.
I0724 02:39:44.601207 140435012744960 tf_logging.py:115] Found TPU system:
I0724 02:39:44.601444 140435012744960 tf_logging.py:115] *** Num TPU Cores: 8
I0724 02:39:44.601916 140435012744960 tf_logging.py:115] *** Num TPU Workers: 1
I0724 02:39:44.602014 140435012744960 tf_logging.py:115] *** Num TPU Cores Per Worker: 8
I0724 02:39:44.602096 140435012744960 tf_logging.py:115] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:CPU:0, CPU, -1)
I0724 02:39:44.602328 140435012744960 tf_logging.py:115] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 17179869184)
I0724 02:39:44.602428 140435012744960 tf_logging.py:115] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:XLA_GPU:0, XLA_GPU, 17179869184)
I0724 02:39:44.602513 140435012744960 tf_logging.py:115] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:0, TPU, 17179869184)
I0724 02:39:44.602597 140435012744960 tf_logging.py:115] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:1, TPU, 17179869184)
I0724 02:39:44.602674 140435012744960 tf_logging.py:115] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:2, TPU, 17179869184)
I0724 02:39:44.602754 140435012744960 tf_logging.py:115] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:3, TPU, 17179869184)
I0724 02:39:44.602828 140435012744960 tf_logging.py:115] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:4, TPU, 17179869184)
I0724 02:39:44.602900 140435012744960 tf_logging.py:115] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:5, TPU, 17179869184)
I0724 02:39:44.602974 140435012744960 tf_logging.py:115] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:6, TPU, 17179869184)
I0724 02:39:44.603046 140435012744960 tf_logging.py:115] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:7, TPU, 17179869184)
I0724 02:39:44.603120 140435012744960 tf_logging.py:115] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU_SYSTEM:0, TPU_SYSTEM, 17179869184)
I0724 02:39:44.616926 140435012744960 tf_logging.py:115] Calling model_fn.
I0724 02:39:53.176857 140435012744960 tf_logging.py:115] Create CheckpointSaverHook.
I0724 02:39:53.386686 140435012744960 tf_logging.py:115] Done calling model_fn.
I0724 02:39:58.610001 140435012744960 tf_logging.py:115] TPU job name worker
I0724 02:39:59.607289 140435012744960 tf_logging.py:115] Graph was finalized.
I0724 02:40:02.081937 140435012744960 tf_logging.py:115] Running local_init_op.
I0724 02:40:02.220175 140435012744960 tf_logging.py:115] Done running local_init_op.
I0724 02:40:09.082227 140435012744960 tf_logging.py:115] Saving checkpoints for 0 into gs://basilica/images/generic/resnet/model.ckpt.
I0724 02:40:16.276730 140435012744960 tf_logging.py:115] Installing graceful shutdown hook.
2018-07-24 02:40:16.277159: W tensorflow/core/distributed_runtime/rpc/grpc_session.cc:349] GrpcSession::ListDevices will initialize the session with an empty graph and other defaults because the session has not yet been created.
I0724 02:40:16.281248 140435012744960 tf_logging.py:115] Creating heartbeat manager for ['/job:tpu_worker/replica:0/task:0/device:CPU:0', '/job:tpu_worker/replica:0/task:0/device:XLA_CPU:0']
W0724 02:40:16.290678 140435012744960 tf_logging.py:120] Worker heartbeats not supported by all workers.  No failure handling will be enabled.
I0724 02:40:16.290915 140435012744960 tf_logging.py:115] Init TPU system
I0724 02:40:24.871825 140433930057472 tf_logging.py:115] Starting infeed thread controller.
I0724 02:40:24.872430 140433921664768 tf_logging.py:115] Starting outfeed thread controller.
I0724 02:40:25.006061 140435012744960 tf_logging.py:115] Enqueue next (100) batch(es) of data to infeed.
I0724 02:40:25.006603 140435012744960 tf_logging.py:115] Dequeue next (100) batch(es) of data from outfeed.
I0724 03:02:57.510490 140435012744960 tf_logging.py:115] loss = 1.4244831, step = 100
I0724 03:02:57.512698 140435012744960 tf_logging.py:115] Enqueue next (100) batch(es) of data to infeed.
I0724 03:02:57.512928 140435012744960 tf_logging.py:115] Dequeue next (100) batch(es) of data from outfeed.
I0724 03:44:09.470874 140435012744960 tf_logging.py:115] loss = 1.5828536, step = 200 (2471.960 sec)
I0724 03:44:09.472132 140435012744960 tf_logging.py:115] global_step/sec: 0.0404537
I0724 03:44:09.942822 140435012744960 tf_logging.py:115] examples/sec: 41.4246
I0724 03:44:09.944439 140435012744960 tf_logging.py:115] Enqueue next (100) batch(es) of data to infeed.
I0724 03:44:09.944626 140435012744960 tf_logging.py:115] Dequeue next (100) batch(es) of data from outfeed.

Embedding a tuple in function params is Python 3 Syntax Error

https://github.com/tensorflow/tpu-demos/blob/master/cloud_tpu/models/movielens/tpu_embedding.py#L187

flake8 testing of https://github.com/tensorflow/tpu-demos on Python 3.6.2

$ flake8 . --count --select=E901,E999,F821,F822,F823 --show-source --statistics

./cloud_tpu/models/movielens/tpu_embedding.py:187:14: E901 SyntaxError: invalid syntax
    params, (values, values_mask), name='sparse_embedding_aggregate_matmul'):
             ^

Error occurred during infeed/outfeed when training Mobilenet on ImageNet

Hi , I got error report when I try to conduct training of TPU Mobilenet :

INFO:tensorflow:Using config: {'_tpu_config': TPUConfig(iterations_per_loop=100, num_shards=1, per_host_input_for_training=True, tpu_job_name=None, initial_infeed_sleep_secs=None), '_save_checkpoints_secs': 1000, '_session_config': allow_soft_placement: true
, '_keep_checkpoint_max': 5, '_tf_random_seed': None, '_task_type': 'worker', '_global_id_in_cluster': 0, '_is_chief': True, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fc4d0afa1d0>, '_model_dir': '/home/simon_lee/imagenet_train', '_num_worker_replicas': 1, '_task_id': 0, '_log_step_count_steps': 100, '_master': u'grpc://10.240.1.2:8470', '_save_checkpoints_steps': None, '_keep_checkpoint_every_n_hours': 10000, '_evaluation_master': u'grpc://10.240.1.2:8470', '_service': None, '_save_summary_steps': 100, '_num_ps_replicas': 0}
INFO:tensorflow:Starting training cycle 0.
INFO:tensorflow:Calling model_fn.
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/losses/losses_impl.py:731: softmax_cross_entropy_with_logits (from tensorflow.python.ops.nn_ops) is deprecated and will be removed in a future version.
Instructions for updating:

Future major versions of TensorFlow will allow gradients to flow
into the labels input on backprop by default.

See tf.nn.softmax_cross_entropy_with_logits_v2.

INFO:tensorflow:Using RMS optimizer
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:TPU job name tpu_worker
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Init TPU system
INFO:tensorflow:Start infeed thread controller
INFO:tensorflow:Starting infeed thread controller.
INFO:tensorflow:Start outfeed thread controller
INFO:tensorflow:Starting outfeed thread controller.
INFO:tensorflow:Enqueue next (100) batch(es) of data to infeed.
INFO:tensorflow:Dequeue next (100) batch(es) of data from outfeed.
WARNING:tensorflow:

Error occurred during infeed/outfeed.  This may be due to a compile error in the main session.  Waiting for a short time for the main session to come back.

File system scheme '[local]' not implemented (file: 'train-*')
         [[Node: TensorSliceDataset/input_pipeline_task0/MatchingFiles = MatchingFiles[](TensorSliceDataset/input_pipeline_task0/MatchingFiles/pattern)]]
         [[Node: input_pipeline_task0/OneShotIterator = OneShotIterator[container="", dataset_factory=_make_dataset_a4fcf249[], output_shapes=[[160,160,160,3], [160]], output_types=[DT_FLOAT, DT_INT32], shared_name="", _device="/job:tpu_worker/replica:0/task:0/device:CPU:0"]()]]
ERROR:tensorflow:Feed error: Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 666, in _run_infeed
    session.run(self._enqueue_ops)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 905, in run
    run_metadata_ptr)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1137, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1355, in _do_run
    options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1374, in _do_call
    raise type(e)(node_def, op, message)
UnimplementedError: File system scheme '[local]' not implemented (file: 'train-*')
         [[Node: TensorSliceDataset/input_pipeline_task0/MatchingFiles = MatchingFiles[](TensorSliceDataset/input_pipeline_task0/MatchingFiles/pattern)]]
         [[Node: input_pipeline_task0/OneShotIterator = OneShotIterator[container="", dataset_factory=_make_dataset_a4fcf249[], output_shapes=[[160,160,160,3], [160]], output_types=[DT_FLOAT, DT_INT32], shared_name="", _device="/job:tpu_worker/replica:0/task:0/device:CPU:0"]()]]

ERROR:tensorflow:Closing session.  A RuntimeError should follow.
INFO:tensorflow:An error was raised. This may be due to a preemption in a connected worker or parameter server. The current session will be closed and a new session will be created. Error: Socket closed
INFO:tensorflow:Graph was finalized.
Exception in thread Thread-2:
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/usr/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 647, in _cancel_session
    session.close()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 702, in close
    tf_session.TF_CloseDeprecatedSession(self._session, status)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/errors_impl.py", line 516, in __exit__
    c_api.TF_GetCode(self.status.status))
UnavailableError: Socket closed

INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Init TPU system
INFO:tensorflow:Start infeed thread controller
INFO:tensorflow:Starting infeed thread controller.
INFO:tensorflow:Start outfeed thread controller
INFO:tensorflow:Starting outfeed thread controller.
WARNING:tensorflow:Feed error occurred, terminating session.

The command I input :

python mobilenet.py \
--dataset_dir gs://imagenet \
--model_dir gs://imagenet_train \
--train_batch_size 160 \
--mode train_and_eval \
--depth_multiplier 0.50 \
--width 160 \
--height 160 \
--use_annotated_bbox True \
--num_classes=1001 \
--learning_rate 0.01 \
--tpu_name demo-tpu

I am also sure that the TPU has connected to GCS since the log can be seen in gs://imagenet_train ,
which is via adding TPU service account into IAM Member as Roles of "Storage Object Admin", "Log Writer", and "Viewer" , as documents mentioned.

Here is components I've installed :

Your current Cloud SDK version is: 194.0.0
The latest available version is: 194.0.0

┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                                  Components                                                 │
├───────────────┬──────────────────────────────────────────────────────┬──────────────────────────┬───────────┤
│     Status    │                         Name                         │            ID            │    Size   │
├───────────────┼──────────────────────────────────────────────────────┼──────────────────────────┼───────────┤
│ Not Installed │ App Engine Go Extensions                             │ app-engine-go            │ 151.9 MiB │
│ Not Installed │ Cloud Bigtable Command Line Tool                     │ cbt                      │   4.5 MiB │
│ Not Installed │ Cloud Bigtable Emulator                              │ bigtable                 │   3.7 MiB │
│ Not Installed │ Cloud Datalab Command Line Tool                      │ datalab                  │   < 1 MiB │
│ Not Installed │ Cloud Datastore Emulator                             │ cloud-datastore-emulator │  17.9 MiB │
│ Not Installed │ Cloud Datastore Emulator (Legacy)                    │ gcd-emulator             │  38.1 MiB │
│ Not Installed │ Cloud Pub/Sub Emulator                               │ pubsub-emulator          │  33.4 MiB │
│ Not Installed │ Emulator Reverse Proxy                               │ emulator-reverse-proxy   │  14.5 MiB │
│ Not Installed │ Google Container Local Builder                       │ container-builder-local  │   3.8 MiB │
│ Not Installed │ Google Container Registry's Docker credential helper │ docker-credential-gcr    │   3.3 MiB │
│ Not Installed │ gcloud app Java Extensions                           │ app-engine-java          │ 118.9 MiB │
│ Not Installed │ gcloud app PHP Extensions                            │ app-engine-php           │           │
│ Not Installed │ gcloud app Python Extensions                         │ app-engine-python        │   6.2 MiB │
│ Not Installed │ gcloud app Python Extensions (Extra Libraries)       │ app-engine-python-extras │  27.8 MiB │
│ Not Installed │ kubectl                                              │ kubectl                  │  12.3 MiB │
│ Installed     │ BigQuery Command Line Tool                           │ bq                       │   < 1 MiB │
│ Installed     │ Cloud SDK Core Libraries                             │ core                     │   7.4 MiB │
│ Installed     │ Cloud Storage Command Line Tool                      │ gsutil                   │   3.4 MiB │
│ Installed     │ gcloud Alpha Commands                                │ alpha                    │   < 1 MiB │
│ Installed     │ gcloud Beta Commands                                 │ beta                     │   < 1 MiB │
└───────────────┴──────────────────────────────────────────────────────┴──────────────────────────┴───────────┘
To install or remove components at your current SDK version [194.0.0], run:
  $ gcloud components install COMPONENT_ID
  $ gcloud components remove COMPONENT_ID

To update your SDK installation to the latest version [194.0.0], run:
  $ gcloud components update

The ImageNet tfrecord files are produced from tfslim script , and work smoothly on slim.
Any suggestions on this problem ? Thank you.

Error occurred during infeed/outfeed when training ResNet on custom data

I0530 14:40:12.307559 140105226974976 tf_logging.py:116] Enqueue next (100) batch(es) of data to infeed.
I0530 14:40:12.308238 140105226974976 tf_logging.py:116] Dequeue next (100) batch(es) of data from outfeed.
W0530 14:40:14.046279 140104498669312 tf_logging.py:126]

Error occurred during infeed/outfeed. This may be due to a compile error in the main session. Waiting for a short time for the main session to come back.

End of sequence
[[Node: input_pipeline_task0/IteratorGetNext = IteratorGetNext_class=["loc:@input_pipeline_task0/InfeedQueue/split/1"], output_shapes=[[1024,224,224,3], [1024]], output_types=[DT_FLOAT, DT_INT32], _device="/job:tpu_worker/replica:0/task:0/device:CPU:0"]]

Caused by op u'input_pipeline_task0/IteratorGetNext', defined at:
File "resnet_main_webface_V01.py", line 508, in
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 126, in run
_sys.exit(main(argv))
File "resnet_main_webface_V01.py", line 440, in main
input_fn=ImageNetInput(True), max_steps=next_checkpoint)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 352, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 812, in _train_model
features, labels, model_fn_lib.ModeKeys.TRAIN, self.config)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 793, in _call_model_fn
model_fn_results = self._model_fn(features=features, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2065, in _model_fn
input_holders.generate_infeed_enqueue_ops_and_dequeue_fn())
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 1149, in generate_infeed_enqueue_ops_and_dequeue_fn
self._invoke_input_fn_and_record_structure())
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 1202, in _invoke_input_fn_and_record_structure
self._batch_axis, host_device))
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 918, in generate_per_host_enqueue_ops_fn_for_host
inputs = _Inputs.from_input_fn(input_fn())
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2036, in _input_fn
return input_fn(**kwargs)
File "resnet_main_webface_V01.py", line 255, in call
images, labels = dataset.make_one_shot_iterator().get_next()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/data/ops/iterator_ops.py", line 330, in get_next
name=name)), self._output_types,
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_dataset_ops.py", line 866, in iterator_get_next
output_shapes=output_shapes, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 3271, in create_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1650, in init
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access

OutOfRangeError (see above for traceback): End of sequence
[[Node: input_pipeline_task0/IteratorGetNext = IteratorGetNext_class=["loc:@input_pipeline_task0/InfeedQueue/split/1"], output_shapes=[[1024,224,224,3], [1024]], output_types=[DT_FLOAT, DT_INT32], _device="/job:tpu_worker/replica:0/task:0/device:CPU:0"]]

E0530 14:40:19.053837 140104481883904 tf_logging.py:106] Feed error: Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 666, in _run_infeed
session.run(self._enqueue_ops)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 905, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1137, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1355, in _do_run
options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1374, in _do_call
raise type(e)(node_def, op, message)
OutOfRangeError: End of sequence
[[Node: input_pipeline_task0/IteratorGetNext = IteratorGetNext_class=["loc:@input_pipeline_task0/InfeedQueue/split/1"], output_shapes=[[1024,224,224,3], [1024]], output_types=[DT_FLOAT, DT_INT32], _device="/job:tpu_worker/replica:0/task:0/device:CPU:0"]]

E0530 14:40:19.054299 140104481883904 tf_logging.py:106] Closing session. A RuntimeError should follow.
I0530 14:40:25.495778 140105226974976 tf_logging.py:116] An error was raised. This may be due to a preemption in a connected worker or parameter server. The current session will be closed and a new session will be created. Error: Socket closed
I0530 14:40:25.496537 140105226974976 tf_logging.py:116] Graph was finalized.
Exception in thread Thread-2:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 754, in run
self.__target(*self.__args, **self.__kwargs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 647, in _cancel_session
session.close()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 702, in close
tf_session.TF_CloseDeprecatedSession(self._session, status)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/errors_impl.py", line 516, in exit
c_api.TF_GetCode(self.status.status))
UnavailableError: Socket closed

Unable to train deeplab on tpu

I have tested tpu in the past month and noticed that it is impossible to use tf.image.resize_images in the training phase.

I am trying to train tpu/models/experimental/deeplab that use tf.image.resize_bilinear at the last logits layer, but training looks like stuck in somewhere as follows:
(There is no next result for 4 hours)

INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:TPU job name worker
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from gs://${BUCKET_NAME}/pretrained/resnet_v1_101/model.ckpt
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Init TPU system
INFO:tensorflow:Start infeed thread controller
INFO:tensorflow:Starting infeed thread controller.
INFO:tensorflow:Start outfeed thread controller
INFO:tensorflow:Starting outfeed thread controller.
INFO:tensorflow:Enqueue next (2000) batch(es) of data to infeed.
INFO:tensorflow:Dequeue next (2000) batch(es) of data from outfeed.

I have tested the same code on tpu with both 1.8 and 1.9 version of tensorflow.

If I change it to tf.layers.conv2d_transpose, it works well(e.g. loss and other debug messages are shown continuously) although the final result might be wrong
Is there a way to use tf.image.resize_images on tpu?

Is it possible to use `tf.Print` in host_call while on TPU?

Since host_call is executed on CPU, is it possible to use tf.Print inside the function?

[Retinanet Training] TypeError: batch() got an unexpected keyword argument 'drop_remainder'

I ran this code on a TPU using python 2.7 and tensorflow 1.9.
I was using the retinanet TPU tutorial as a template, substituting in my own data.

RESNET_CHECKPOINT=gs://cloud-tpu-artifacts/resnet/resnet-nhwc-2018-02-07/model.ckpt-112603
MODEL_DIR=${GCS_BUCKET}/retinanet/retinanet-model

python tpu/models/official/retinanet/retinanet_main.py
--tpu=${TPU_NAME}
--train_batch_size=64
--training_file_pattern=${GCS_BUCKET}/data/coco/train-*
--resnet_checkpoint=${RESNET_CHECKPOINT}
--model_dir=${MODEL_DIR}
--hparams=image_size=640
--num_examples_per_epoch=6400
--num_epochs=1

This is my error:
Traceback (most recent call last):
File "tpu/models/official/retinanet/retinanet_main.py", line 295, in
tf.app.run(main)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "tpu/models/official/retinanet/retinanet_main.py", line 171, in main
FLAGS.train_batch_size))
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 366, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 1119, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 1132, in _train_model_default
features, labels, model_fn_lib.ModeKeys.TRAIN, self.config)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 1992, in _call_model_fn
features, labels, mode, config)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 1107, in _call_model_fn
model_fn_results = self._model_fn(features=features, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2212, in _model_fn
input_holders.generate_infeed_enqueue_ops_and_dequeue_fn())
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 1001, in generate_infeed_enqueue_ops_and_dequeue_fn
self._invoke_input_fn_and_record_structure())
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 1059, in _invoke_input_fn_and_record_structure
self._inputs_structure_recorder, host_device, host_id))
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 771, in generate_per_host_v2_enqueue_ops_fn_for_host
inputs = _Inputs.from_input_fn(input_fn(user_context))
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2163, in _input_fn
return input_fn(**kwargs)
File "/home/aaflier/tpu/models/official/retinanet/dataloader.py", line 353, in call
dataset = dataset.batch(batch_size, drop_remainder=True)
TypeError: batch() got an unexpected keyword argument 'drop_remainder'

Has anyone encountered this question? It's like an incompatible problem,but I'm not sure.

[Retinanet Training] TypeError: Parameter does not match pattern

I ran this code on a TPU using python 3.5 and tensorflow 1.9.
I was using the retinanet TPU tutorial as a template, substituting in my own data.

This was my input:

RESNET_CHECKPOINT=gs://cloud-tpu-artifacts/resnet/resnet-nhwc-2018-02-07/model.ckpt-112603
STORAGE_BUCKET=gs://uxml_store
MODEL_DIR=${STORAGE_BUCKET}/retinanet

sudo pip3 install cython matplotlib pycocotools
sudo apt-get install python3-tk

python3 tpu/models/official/retinanet/retinanet_main.py \
 --tpu=retinanet-tpu \
 --train_batch_size=8 \
 --training_file_pattern=${STORAGE_BUCKET}/tfrecords/train.tfrecord \
 --resnet_checkpoint=${RESNET_CHECKPOINT} \
 --model_dir=${MODEL_DIR} \
 --hparams=image_size=640 \
 --num_examples_per_epoch=6400 \
 --num_epochs=1

I'm encountering the following error.

Traceback (most recent call last):
  File "tpu/models/official/retinanet/retinanet_main.py", line 296, in <module>
    tf.app.run(main)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "tpu/models/official/retinanet/retinanet_main.py", line 107, in main
    tpu_grpc_url = tpu_cluster_resolver.get_master()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/cluster_resolver/python/training/tpu_cluster_resolver.py", line 225, in get_master
    return self.master()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/cluster_resolver/python/training/tpu_cluster_resolver.py", line 218, in master
    job_tasks = self.cluster_spec().job_tasks(self._job_name)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/cluster_resolver/python/training/tpu_cluster_resolver.py", line 255, in cluster_spec
    request = self._service.projects().locations().nodes().get(name=full_name)
  File "/usr/local/lib/python3.5/dist-packages/googleapiclient/discovery.py", line 742, in method
    (name, pvalue, regex))
TypeError: Parameter "name" value "projects//locations/us-east1-b/nodes/retinanet-tpu" does not match the pattern "^projects/[^/]+/locations/[^/]+/nodes/[^/]+$"

I'm guessing this has something to do with the project section of the name parameter being blank, because of the project**//**locations, but I can't find anyway to change this. Any ideas?

ctpu go issue

Hi, I followed the instructions for downloading the ctpu tool to use locally but came across this error:

./ctpu status
Error encountered while creating configuration: user: Current not implemented on linux/amd64

Reinforcement Learning Example

Are reinforcement learning algorithms supported on the TPU? Do you plan to add examples for the same?

Resnet tutorial with `imagenet_to_gcs.py` doesn't appear to work.

I'm attempting to follow https://cloud.google.com/tpu/docs/tutorials/resnet . When I try to use the full ImageNet dataset, using the provided imagenet_to_gcs.py script, I get an error (pasted below).

The error below is with Tensorflow 1.9 . I also tried with 1.8, and get a different error, which seems unrelated but which I also pasted below for posterity.

The results of the imagenet_to_gcs.py script are at gs://basilica/data/imagenet, if that's helpful (should be world-readable for now).

The command I'm running:

python resnet_main.py --tpu=basilica --data_dir=gs://basilica/data/imagenet --model_dir=gs://basilica/tst

This is the 1.9 error:

W0727 03:45:33.424566 140200254113536 __init__.py:44] file_cache is unavailable when using oauth2client >= 4.0.0 or google-auth
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/googleapiclient/discovery_cache/__init__.py", line 41, in autodetect
    from . import file_cache
  File "/usr/local/lib/python2.7/dist-packages/googleapiclient/discovery_cache/file_cache.py", line 41, in <module>
    'file_cache is unavailable when using oauth2client >= 4.0.0 or google-auth')
ImportError: file_cache is unavailable when using oauth2client >= 4.0.0 or google-auth
W0727 03:45:33.622759 140200254113536 tf_logging.py:125] Estimator's model_fn (<function resnet_model_fn at 0x7f82cd0518c0>) includes params argument, but params are not passed to Estimator.
I0727 03:45:33.623796 140200254113536 tf_logging.py:115] Using config: {'_save_checkpoints_secs': None, '_session_config': allow_soft_placement: true
cluster_def {
  job {
    name: "worker"
    tasks {
      value: "10.240.1.2:8470"
    }
  }
}
, '_keep_checkpoint_max': 5, '_task_type': 'worker', '_train_distribute': None, '_is_chief': True, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f82cc71dbd0>, '_model_d\
ir': 'gs://basilica/tst', '_save_checkpoints_steps': 600, '_keep_checkpoint_every_n_hours': 10000, '_service': None, '_num_ps_replicas': 0, '_tpu_config': TPUConfig(iterations_per_loop=100, num_shards=8,\
 computation_shape=None, per_host_input_for_training=3, tpu_job_name=None, initial_infeed_sleep_secs=None), '_tf_random_seed': None, '_save_summary_steps': 100, '_device_fn': None, '_cluster': <tensorflo\
w.contrib.cluster_resolver.python.training.tpu_cluster_resolver.TPUClusterResolver object at 0x7f82cd0536d0>, '_num_worker_replicas': 1, '_task_id': 0, '_log_step_count_steps': None, '_evaluation_master'\
: u'grpc://10.240.1.2:8470', '_global_id_in_cluster': 0, '_master': u'grpc://10.240.1.2:8470'}
I0727 03:45:33.624113 140200254113536 tf_logging.py:115] _TPUContext: eval_on_tpu True
I0727 03:45:33.624320 140200254113536 tf_logging.py:115] Precision: bfloat16
I0727 03:45:34.045315 140200254113536 tf_logging.py:115] Training for 112603 steps (90.00 epochs in total). Current step 0.
I0727 03:45:34.196645 140200254113536 tf_logging.py:115] Querying Tensorflow master (grpc://10.240.1.2:8470) for TPU system metadata.
2018-07-27 03:45:34.203501: W tensorflow/core/distributed_runtime/rpc/grpc_session.cc:349] GrpcSession::ListDevices will initialize the session with an empty graph and other defaults because the session \
has not yet been created.
I0727 03:45:34.290231 140200254113536 tf_logging.py:115] Found TPU system:
I0727 03:45:34.290575 140200254113536 tf_logging.py:115] *** Num TPU Cores: 8
I0727 03:45:34.290924 140200254113536 tf_logging.py:115] *** Num TPU Workers: 1
I0727 03:45:34.290985 140200254113536 tf_logging.py:115] *** Num TPU Cores Per Worker: 8
I0727 03:45:34.291043 140200254113536 tf_logging.py:115] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:CPU:0, CPU, -1)
I0727 03:45:34.291238 140200254113536 tf_logging.py:115] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 17179869184)
I0727 03:45:34.291296 140200254113536 tf_logging.py:115] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:XLA_GPU:0, XLA_GPU, 17179869184)
I0727 03:45:34.291357 140200254113536 tf_logging.py:115] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:0, TPU, 17179869184)
I0727 03:45:34.291414 140200254113536 tf_logging.py:115] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:1, TPU, 17179869184)
I0727 03:45:34.291465 140200254113536 tf_logging.py:115] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:2, TPU, 17179869184)
I0727 03:45:34.291536 140200254113536 tf_logging.py:115] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:3, TPU, 17179869184)
I0727 03:45:34.291625 140200254113536 tf_logging.py:115] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:4, TPU, 17179869184)
I0727 03:45:34.291692 140200254113536 tf_logging.py:115] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:5, TPU, 17179869184)
I0727 03:45:34.291764 140200254113536 tf_logging.py:115] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:6, TPU, 17179869184)
I0727 03:45:34.291825 140200254113536 tf_logging.py:115] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:7, TPU, 17179869184)
I0727 03:45:34.291878 140200254113536 tf_logging.py:115] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU_SYSTEM:0, TPU_SYSTEM, 17179869184)
I0727 03:45:34.306799 140200254113536 tf_logging.py:115] Calling model_fn.
I0727 03:45:45.303158 140200254113536 tf_logging.py:115] Create CheckpointSaverHook.
I0727 03:45:45.567073 140200254113536 tf_logging.py:115] Done calling model_fn.
I0727 03:45:48.642914 140200254113536 tf_logging.py:115] TPU job name worker
I0727 03:45:49.806272 140200254113536 tf_logging.py:115] Graph was finalized.
I0727 03:45:52.023736 140200254113536 tf_logging.py:115] Running local_init_op.
I0727 03:45:52.160645 140200254113536 tf_logging.py:115] Done running local_init_op.
I0727 03:45:59.174904 140200254113536 tf_logging.py:115] Saving checkpoints for 0 into gs://basilica/tst/model.ckpt.
I0727 03:46:06.400580 140200254113536 tf_logging.py:115] Installing graceful shutdown hook.
2018-07-27 03:46:06.401149: W tensorflow/core/distributed_runtime/rpc/grpc_session.cc:349] GrpcSession::ListDevices will initialize the session with an empty graph and other defaults because the session \
has not yet been created.
I0727 03:46:06.429027 140200254113536 tf_logging.py:115] Creating heartbeat manager for ['/job:tpu_worker/replica:0/task:0/device:CPU:0', '/job:tpu_worker/replica:0/task:0/device:XLA_CPU:0']
W0727 03:46:06.478003 140200254113536 tf_logging.py:120] Worker heartbeats not supported by all workers.  No failure handling will be enabled.
I0727 03:46:06.478421 140200254113536 tf_logging.py:115] Init TPU system
I0727 03:46:12.805318 140199236466432 tf_logging.py:115] Starting infeed thread controller.
I0727 03:46:12.806582 140199227746048 tf_logging.py:115] Starting outfeed thread controller.
I0727 03:46:12.936597 140200254113536 tf_logging.py:115] Enqueue next (100) batch(es) of data to infeed.
I0727 03:46:12.937171 140200254113536 tf_logging.py:115] Dequeue next (100) batch(es) of data from outfeed.
W0727 03:51:35.339397 140199236466432 tf_logging.py:125]

Error occurred during infeed/outfeed.  This may be due to a compile error in the main session.  Waiting for a short time for the main session to come back.

End of sequence
         [[Node: input_pipeline_task0/while/IteratorGetNext = IteratorGetNext[_class=["loc:@input_pipeline_task0/while/InfeedQueue/enqueue/0"], output_shapes=[[224,224,3,128], [128]], output_types=[DT_BF\
LOAT16, DT_INT32], _device="/job:worker/replica:0/task:0/device:CPU:0"](input_pipeline_task0/while/IteratorGetNext/Enter, ^input_pipeline_task0/while/Identity)]]

Caused by op u'input_pipeline_task0/while/IteratorGetNext', defined at:
  File "resnet_main.py", line 506, in <module>
    tf.app.run()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "resnet_main.py", line 480, in main
    input_fn=imagenet_train.input_fn, max_steps=next_checkpoint)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 366, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 1119, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 1132, in _train_model_default
    features, labels, model_fn_lib.ModeKeys.TRAIN, self.config)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 1992, in _call_model_fn
    features, labels, mode, config)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 1107, in _call_model_fn
    model_fn_results = self._model_fn(features=features, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2212, in _model_fn
    input_holders.generate_infeed_enqueue_ops_and_dequeue_fn())
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 1001, in generate_infeed_enqueue_ops_and_dequeue_fn
    self._invoke_input_fn_and_record_structure())
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 1087, in _invoke_input_fn_and_record_structure
    wrap_fn(device=host_device, op_fn=enqueue_ops_fn))
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2586, in _wrap_computation_in_while_loop
    parallel_iterations=1)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 3209, in while_loop
    result = loop_context.BuildLoop(cond, body, loop_vars, shape_invariants)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 2941, in BuildLoop
    pred, body, original_loop_vars, loop_vars, shape_invariants)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 2878, in _BuildLoop
    body_result = body(*packed_vars_for_body)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2575, in computation
    with ops.control_dependencies(op_fn()):
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 794, in enqueue_ops_fn
    features, labels = inputs.features_and_labels()  # Calls get_next()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2766, in features_and_labels
    return _Inputs._parse_inputs(self._iterator.get_next())
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/data/ops/iterator_ops.py", line 373, in get_next
    name=name)), self._output_types,
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_dataset_ops.py", line 1745, in iterator_get_next
    output_shapes=output_shapes, name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 3414, in create_op
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1740, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

OutOfRangeError (see above for traceback): End of sequence
         [[Node: input_pipeline_task0/while/IteratorGetNext = IteratorGetNext[_class=["loc:@input_pipeline_task0/while/InfeedQueue/enqueue/0"], output_shapes=[[224,224,3,128], [128]], output_types=[DT_BF\
LOAT16, DT_INT32], _device="/job:worker/replica:0/task:0/device:CPU:0"](input_pipeline_task0/while/IteratorGetNext/Enter, ^input_pipeline_task0/while/Identity)]]

E0727 03:51:40.460005 140199219025664 tf_logging.py:105] Feed error: Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 434, in _run_infeed
    session.run(self._enqueue_ops)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 900, in run
    run_metadata_ptr)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1135, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1316, in _do_run
    run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1335, in _do_call
    raise type(e)(node_def, op, message)
OutOfRangeError: End of sequence
         [[Node: input_pipeline_task0/while/IteratorGetNext = IteratorGetNext[_class=["loc:@input_pipeline_task0/while/InfeedQueue/enqueue/0"], output_shapes=[[224,224,3,128], [128]], output_types=[DT_BF\
LOAT16, DT_INT32], _device="/job:worker/replica:0/task:0/device:CPU:0"](input_pipeline_task0/while/IteratorGetNext/Enter, ^input_pipeline_task0/while/Identity)]]

Caused by op u'input_pipeline_task0/while/IteratorGetNext', defined at:
  File "resnet_main.py", line 506, in <module>
    tf.app.run()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "resnet_main.py", line 480, in main
    input_fn=imagenet_train.input_fn, max_steps=next_checkpoint)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 366, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 1119, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 1132, in _train_model_default
    features, labels, model_fn_lib.ModeKeys.TRAIN, self.config)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 1992, in _call_model_fn
    features, labels, mode, config)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 1107, in _call_model_fn
    model_fn_results = self._model_fn(features=features, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2212, in _model_fn
    input_holders.generate_infeed_enqueue_ops_and_dequeue_fn())
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 1001, in generate_infeed_enqueue_ops_and_dequeue_fn
    self._invoke_input_fn_and_record_structure())
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 1087, in _invoke_input_fn_and_record_structure
    wrap_fn(device=host_device, op_fn=enqueue_ops_fn))
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2586, in _wrap_computation_in_while_loop
    parallel_iterations=1)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 3209, in while_loop
    result = loop_context.BuildLoop(cond, body, loop_vars, shape_invariants)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 2941, in BuildLoop
    pred, body, original_loop_vars, loop_vars, shape_invariants)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 2878, in _BuildLoop
    body_result = body(*packed_vars_for_body)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2575, in computation
    with ops.control_dependencies(op_fn()):
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 794, in enqueue_ops_fn
    features, labels = inputs.features_and_labels()  # Calls get_next()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2766, in features_and_labels
    return _Inputs._parse_inputs(self._iterator.get_next())
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/data/ops/iterator_ops.py", line 373, in get_next
    name=name)), self._output_types,
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_dataset_ops.py", line 1745, in iterator_get_next
    output_shapes=output_shapes, name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 3414, in create_op
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1740, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

OutOfRangeError (see above for traceback): End of sequence
         [[Node: input_pipeline_task0/while/IteratorGetNext = IteratorGetNext[_class=["loc:@input_pipeline_task0/while/InfeedQueue/enqueue/0"], output_shapes=[[224,224,3,128], [128]], output_types=[DT_BF\
LOAT16, DT_INT32], _device="/job:worker/replica:0/task:0/device:CPU:0"](input_pipeline_task0/while/IteratorGetNext/Enter, ^input_pipeline_task0/while/Identity)]]


E0727 03:51:40.460639 140199219025664 tf_logging.py:105] Closing session.  A RuntimeError should follow.
W0727 03:52:01.983748 140199227746048 tf_logging.py:125]

Error occurred during infeed/outfeed.  This may be due to a compile error in the main session.  Waiting for a short time for the main session to come back.

Step was cancelled by an explicit call to `Session::Close()`.
Traceback (most recent call last):
  File "resnet_main.py", line 506, in <module>
    tf.app.run()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "resnet_main.py", line 480, in main
    input_fn=imagenet_train.input_fn, max_steps=next_checkpoint)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 366, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 1119, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 1135, in _train_model_default
    saving_listeners)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 1336, in _train_with_estimator_spec
    _, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 577, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1053, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1144, in run
    raise six.reraise(*original_exc_info)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1129, in run
    return self._sess.run(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1201, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 981, in run
    return self._sess.run(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 900, in run
    run_metadata_ptr)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1135, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1316, in _do_run
    run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1335, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.CancelledError: Step was cancelled by an explicit call to `Session::Close()`.

This is the 1.8 error:

W0727 04:13:10.758173 140606586783552 tf_logging.py:125] Estimator's model_fn (<function resnet_model_fn at 0x7fe174596050>) includes params argument, but params are not passed to Estimator.
I0727 04:13:10.759254 140606586783552 tf_logging.py:115] Using config: {'_save_checkpoints_secs': None, '_session_config': allow_soft_placement: true
cluster_def {
  job {
    name: "worker"
    tasks {
      value: "10.0.16.122:8470" 
    }
  }
}
, '_keep_checkpoint_max': 5, '_task_type': 'worker', '_train_distribute': None, '_is_chief': True, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fe174597210>, '_model_di
r': 'gs://basilica/tst2', '_save_checkpoints_steps': 600, '_keep_checkpoint_every_n_hours': 10000, '_service': None, '_num_ps_replicas': 0, '_tpu_config': TPUConfig(iterations_per_loop=100, num_shards=8, 
computation_shape=None, per_host_input_for_training=3, tpu_job_name=None, initial_infeed_sleep_secs=None), '_tf_random_seed': None, '_save_summary_steps': 100, '_device_fn': None, '_cluster': <tensorflow.
contrib.cluster_resolver.python.training.tpu_cluster_resolver.TPUClusterResolver object at 0x7fe174597090>, '_num_worker_replicas': 1, '_task_id': 0, '_log_step_count_steps': None, '_evaluation_master': '
grpc://10.0.16.122:8470', '_global_id_in_cluster': 0, '_master': 'grpc://10.0.16.122:8470'}
I0727 04:13:10.759718 140606586783552 tf_logging.py:115] _TPUContext: eval_on_tpu True
I0727 04:13:10.759957 140606586783552 tf_logging.py:115] Precision: bfloat16
I0727 04:13:10.900378 140606586783552 tf_logging.py:115] Training for 112603 steps (90.00 epochs in total). Current step 0.
I0727 04:13:11.008833 140606586783552 tf_logging.py:115] Querying Tensorflow master (grpc://10.0.16.122:8470) for TPU system metadata.
2018-07-27 04:13:11.010382: W tensorflow/core/distributed_runtime/rpc/grpc_session.cc:349] GrpcSession::ListDevices will initialize the session with an empty graph and other defaults because the session h
as not yet been created.
I0727 04:13:11.118653 140606586783552 tf_logging.py:115] Found TPU system:
I0727 04:13:11.119147 140606586783552 tf_logging.py:115] *** Num TPU Cores: 8
I0727 04:13:11.119513 140606586783552 tf_logging.py:115] *** Num TPU Workers: 1
I0727 04:13:11.119638 140606586783552 tf_logging.py:115] *** Num TPU Cores Per Worker: 8
I0727 04:13:11.119782 140606586783552 tf_logging.py:115] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:CPU:0, CPU, -1)
I0727 04:13:11.120002 140606586783552 tf_logging.py:115] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 17179869184)
I0727 04:13:11.120161 140606586783552 tf_logging.py:115] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:XLA_GPU:0, XLA_GPU, 17179869184)
I0727 04:13:11.120342 140606586783552 tf_logging.py:115] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:0, TPU, 17179869184)
I0727 04:13:11.120484 140606586783552 tf_logging.py:115] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:1, TPU, 17179869184)
I0727 04:13:11.120628 140606586783552 tf_logging.py:115] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:2, TPU, 17179869184)
I0727 04:13:11.120737 140606586783552 tf_logging.py:115] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:3, TPU, 17179869184)
I0727 04:13:11.120873 140606586783552 tf_logging.py:115] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:4, TPU, 17179869184)
I0727 04:13:11.120979 140606586783552 tf_logging.py:115] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:5, TPU, 17179869184)
I0727 04:13:11.121112 140606586783552 tf_logging.py:115] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:6, TPU, 17179869184)
I0727 04:13:11.121259 140606586783552 tf_logging.py:115] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:7, TPU, 17179869184)
I0727 04:13:11.121380 140606586783552 tf_logging.py:115] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU_SYSTEM:0, TPU_SYSTEM, 17179869184)
I0727 04:13:11.136253 140606586783552 tf_logging.py:115] Calling model_fn.
I0727 04:13:20.833169 140606586783552 tf_logging.py:115] Create CheckpointSaverHook.
I0727 04:13:21.049246 140606586783552 tf_logging.py:115] Done calling model_fn.
I0727 04:13:24.118320 140606586783552 tf_logging.py:115] TPU job name worker
I0727 04:13:25.411591 140606586783552 tf_logging.py:115] Graph was finalized.
Traceback (most recent call last):
  File "/tensorflow_tpu_models/models/official/resnet/resnet_main.py", line 506, in <module>
    tf.app.run()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "/tensorflow_tpu_models/models/official/resnet/resnet_main.py", line 480, in main
    input_fn=imagenet_train.input_fn, max_steps=next_checkpoint)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 366, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 1119, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 1135, in _train_model_default
    saving_listeners)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 1333, in _train_with_estimator_spec
    log_step_count_steps=self._config.log_step_count_steps) as mon_sess:
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 415, in MonitoredTrainingSession
    stop_grace_period_secs=stop_grace_period_secs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 826, in __init__
    stop_grace_period_secs=stop_grace_period_secs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 549, in __init__
    self._sess = _RecoverableSession(self._coordinated_creator)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1012, in __init__
    _WrappedSession.__init__(self, self._create_session())
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1017, in _create_session
    return self._sess_creator.create_session()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 706, in create_session
    self.tf_sess = self._session_creator.create_session()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 477, in create_session
    init_fn=self._scaffold.init_fn)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/session_manager.py", line 287, in prepare_session
    sess.run(init_op, feed_dict=init_feed_dict)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 900, in run
    run_metadata_ptr)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1135, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1316, in _do_run
    run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1335, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.NotFoundError: Op type not registered 'MapAndBatchDatasetV2' in binary running on n-a432ad87-w-0. Make sure the Op and Kernel are registered in the binary running in this process.

Looking for ImageNet data => TFRecord script

In /resnet_bfloat36/imagnet_input.py, the documentation refers to the following link for creating TFRecords of ImageNet data: https://github.com/tensorflow/tpu-demos/blob/master/cloud_tpu/datasets/imagenet_to_gcs.py

However, this file does not seem to exist anymore. Where is the script available now?
Thanks.

Why resnet downloaded from here cannot work in VM?

First I downloaded the tpu package.
Then I transfer it to VM.

The code in the serve can run well.
But code of this package cannot run in the server.

A lot of bugs, when I solved one. Another one came out.

I know the code in the Cloud machine is different from github. Why they are inconsistent?

UnavailableError: OS Error on Retinanet

Hi guys,

I encounter an error when I try to train a RetinaNet (on TPU) on the coco dataset. I am following https://cloud.google.com/tpu/docs/tutorials/retinanet tutorial.

I System

The tf-1-7 image, as mentioned in the tutorial.

II Error

When I launch the training i get the following error:

 python tpu/models/official/retinanet/retinanet_main.py \
 --master=${GRPC_SERVER} \
 --train_batch_size=64 \
 --training_file_pattern="${GCS_BUCKET}/coco/train-*" \
 --resnet_checkpoint=${RESNET_CHECKPOINT} \
 --model_dir=${MODEL_DIR} \
 --hparams=image_size=640 \
 --num_examples_per_epoch=6400 \
 --num_epochs=1


WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/datasets/base.py:198: retry (from tensorflow.contrib.learn.python.learn.datasets.base) is deprecated and will be removed in a future version.
Instructions for updating:
Use the retry module or similar alternatives.
Traceback (most recent call last):
  File "tpu/models/official/retinanet/retinanet_main.py", line 247, in <module>
    tf.app.run(main)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 126, in run
    _sys.exit(main(argv))
  File "tpu/models/official/retinanet/retinanet_main.py", line 117, in main
    tf.Session.reset(tpu_grpc_url)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1584, in reset
    tf_session.TF_Reset(target, containers, config)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 1518, in TF_Reset
    TF_Reset_wrapper(opts, containers, status)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/errors_impl.py", line 516, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.UnavailableError: OS Error

Any idea?

tensorflow / tpu Goto Github PK

tpu's Introduction

Cloud TPUs

Running Models

tpu's People

Contributors

Stargazers

Watchers

Forkers

tpu's Issues

This is the 1.9 error:

This is the 1.8 error:

I System

II Error

Recommend Projects

Recommend Topics

Recommend Org