pytorch / xla Goto Github PK

Enabling PyTorch on XLA Devices (e.g. Google TPU)

License: Other

Shell 1.18% Python 37.78% C++ 48.32% Jupyter Notebook 11.25% Dockerfile 0.14% C 0.01% HCL 0.72% Starlark 0.60% Makefile 0.01%

compiler deep-learning pytorch xla

xla's Introduction

PyTorch/XLA

Current CI status:

PyTorch/XLA is a Python package that uses the XLA deep learning compiler to connect the PyTorch deep learning framework and Cloud TPUs. You can try it right now, for free, on a single Cloud TPU VM with Kaggle!

Take a look at one of our Kaggle notebooks to get started:

Getting Started

PyTorch/XLA is now on PyPI!

To install PyTorch/XLA a new TPU VM:

pip install torch~=2.3.0 torch_xla[tpu]~=2.3.0 -f https://storage.googleapis.com/libtpu-releases/index.html

To update your existing training loop, make the following changes:

-import torch.multiprocessing as mp
+import torch_xla as xla
+import torch_xla.core.xla_model as xm
+import torch_xla.distributed.xla_multiprocessing as xmp

 def _mp_fn(index):
   ...

+  # Move the model paramters to your XLA device
+  model.to(xla.device())

   for inputs, labels in train_loader:
+    with xla.step():
+      # Transfer data to the XLA device. This happens asynchronously.
+      inputs, labels = inputs.to(xla.device()), labels.to(xla.device())
       optimizer.zero_grad()
       outputs = model(inputs)
       loss = loss_fn(outputs, labels)
       loss.backward()
-      optimizer.step()
+      # `xm.optimizer_step` combines gradients across replicas
+      xm.optimizer_step(optimizer)

 if __name__ == '__main__':
-  mp.spawn(_mp_fn, args=(), nprocs=world_size)
+  # xmp.spawn automatically selects the correct world size
+  xmp.spawn(_mp_fn, args=())

If you're using DistributedDataParallel, make the following changes:

 import torch.distributed as dist
-import torch.multiprocessing as mp
+import torch_xla as xla
+import torch_xla.distributed.xla_multiprocessing as xmp
+import torch_xla.distributed.xla_backend

 def _mp_fn(rank):
   ...

-  os.environ['MASTER_ADDR'] = 'localhost'
-  os.environ['MASTER_PORT'] = '12355'
-  dist.init_process_group("gloo", rank=rank, world_size=world_size)
+  # Rank and world size are inferred from the XLA device runtime
+  dist.init_process_group("xla", init_method='xla://')
+
+  model.to(xm.xla_device())
+  # `gradient_as_bucket_view=True` required for XLA
+  ddp_model = DDP(model, gradient_as_bucket_view=True)

-  model = model.to(rank)
-  ddp_model = DDP(model, device_ids=[rank])

   for inputs, labels in train_loader:
+    with xla.step():
+      inputs, labels = inputs.to(xla.device()), labels.to(xla.device())
       optimizer.zero_grad()
       outputs = ddp_model(inputs)
       loss = loss_fn(outputs, labels)
       loss.backward()
       optimizer.step()

 if __name__ == '__main__':
-  mp.spawn(_mp_fn, args=(), nprocs=world_size)
+  xmp.spawn(_mp_fn, args=())

Additional information on PyTorch/XLA, including a description of its semantics and functions, is available at PyTorch.org. See the API Guide for best practices when writing networks that run on XLA devices (TPU, CUDA, CPU and...).

Our comprehensive user guides are available at:

Documentation for the latest release

Documentation for master branch

PyTorch/XLA tutorials

Available docker images and wheels

Python packages

PyTorch/XLA releases starting with version r2.1 will be available on PyPI. You can now install the main build with pip install torch_xla. To also install the Cloud TPU plugin, install the optional tpu dependencies after installing the main build with

pip install torch_xla[tpu] -f https://storage.googleapis.com/libtpu-releases/index.html

GPU and nightly builds are available in our public GCS bucket.

Version	Cloud TPU/GPU VMs Wheel
2.3 (Python 3.8)	`https://storage.googleapis.com/pytorch-xla-releases/wheels/tpuvm/torch_xla-2.3.0-cp38-cp38-manylinux_2_28_x86_64.whl`
2.3 (Python 3.10)	`https://storage.googleapis.com/pytorch-xla-releases/wheels/tpuvm/torch_xla-2.3.0-cp310-cp310-manylinux_2_28_x86_64.whl`
2.3 (CUDA 12.1 + Python 3.8)	`https://storage.googleapis.com/pytorch-xla-releases/wheels/cuda/12.1/torch_xla-2.3.0-cp38-cp38-manylinux_2_28_x86_64.whl`
2.3 (CUDA 12.1 + Python 3.10)	`https://storage.googleapis.com/pytorch-xla-releases/wheels/cuda/12.1/torch_xla-2.3.0-cp310-cp310-manylinux_2_28_x86_64.whl`
nightly (Python 3.8)	`https://storage.googleapis.com/pytorch-xla-releases/wheels/tpuvm/torch_xla-nightly-cp38-cp38-linux_x86_64.whl`
nightly (Python 3.10)	`https://storage.googleapis.com/pytorch-xla-releases/wheels/tpuvm/torch_xla-nightly-cp310-cp310-linux_x86_64.whl`
nightly (CUDA 12.1 + Python 3.8)	`https://storage.googleapis.com/pytorch-xla-releases/wheels/cuda/12.1/torch_xla-nightly-cp38-cp38-linux_x86_64.whl`

You can also add +yyyymmdd after torch_xla-nightly to get the nightly wheel of a specified date. To get the companion pytorch nightly wheel, replace the torch_xla with torch on above wheel links.

older versions

Version	Cloud TPU VMs Wheel
2.2 (Python 3.8)	`https://storage.googleapis.com/pytorch-xla-releases/wheels/tpuvm/torch_xla-2.2.0-cp38-cp38-manylinux_2_28_x86_64.whl`
2.2 (Python 3.10)	`https://storage.googleapis.com/pytorch-xla-releases/wheels/tpuvm/torch_xla-2.2.0-cp310-cp310-manylinux_2_28_x86_64.whl`
2.2 (CUDA 12.1 + Python 3.8)	`https://storage.googleapis.com/pytorch-xla-releases/wheels/cuda/12.1/torch_xla-2.2.0-cp38-cp38-manylinux_2_28_x86_64.whl`
2.2 (CUDA 12.1 + Python 3.10)	`https://storage.googleapis.com/pytorch-xla-releases/wheels/cuda/12.1/torch_xla-2.2.0-cp310-cp310-manylinux_2_28_x86_64.whl`
2.1 (XRT + Python 3.10)	`https://storage.googleapis.com/pytorch-xla-releases/wheels/xrt/tpuvm/torch_xla-2.1.0%2Bxrt-cp310-cp310-manylinux_2_28_x86_64.whl`
2.1 (Python 3.8)	`https://storage.googleapis.com/pytorch-xla-releases/wheels/tpuvm/torch_xla-2.1.0-cp38-cp38-linux_x86_64.whl`
2.0 (Python 3.8)	`https://storage.googleapis.com/tpu-pytorch/wheels/tpuvm/torch_xla-2.0-cp38-cp38-linux_x86_64.whl`
1.13	`https://storage.googleapis.com/tpu-pytorch/wheels/tpuvm/torch_xla-1.13-cp38-cp38-linux_x86_64.whl`
1.12	`https://storage.googleapis.com/tpu-pytorch/wheels/tpuvm/torch_xla-1.12-cp38-cp38-linux_x86_64.whl`
1.11	`https://storage.googleapis.com/tpu-pytorch/wheels/tpuvm/torch_xla-1.11-cp38-cp38-linux_x86_64.whl`
1.10	`https://storage.googleapis.com/tpu-pytorch/wheels/tpuvm/torch_xla-1.10-cp38-cp38-linux_x86_64.whl`

Note: For TPU Pod customers using XRT (our legacy runtime), we have custom wheels for torch and torch_xla at https://storage.googleapis.com/tpu-pytorch/wheels/xrt.

Package	Cloud TPU VMs Wheel (XRT on Pod, Legacy Only)
torch_xla	`https://storage.googleapis.com/pytorch-xla-releases/wheels/xrt/tpuvm/torch_xla-2.1.0%2Bxrt-cp310-cp310-manylinux_2_28_x86_64.whl`
torch	`https://storage.googleapis.com/pytorch-xla-releases/wheels/xrt/tpuvm/torch-2.1.0%2Bxrt-cp310-cp310-linux_x86_64.whl`

Version	GPU Wheel + Python 3.8
2.1+ CUDA 11.8	`https://storage.googleapis.com/pytorch-xla-releases/wheels/cuda/11.8/torch_xla-2.1.0-cp38-cp38-manylinux_2_28_x86_64.whl`
2.0 + CUDA 11.8	`https://storage.googleapis.com/tpu-pytorch/wheels/cuda/118/torch_xla-2.0-cp38-cp38-linux_x86_64.whl`
2.0 + CUDA 11.7	`https://storage.googleapis.com/tpu-pytorch/wheels/cuda/117/torch_xla-2.0-cp38-cp38-linux_x86_64.whl`
1.13	`https://storage.googleapis.com/tpu-pytorch/wheels/cuda/112/torch_xla-1.13-cp38-cp38-linux_x86_64.whl`
nightly + CUDA 12.0 >= 2023/06/27	`https://storage.googleapis.com/pytorch-xla-releases/wheels/cuda/12.0/torch_xla-nightly-cp38-cp38-linux_x86_64.whl`
nightly + CUDA 11.8 <= 2023/04/25	`https://storage.googleapis.com/tpu-pytorch/wheels/cuda/118/torch_xla-nightly-cp38-cp38-linux_x86_64.whl`
nightly + CUDA 11.8 >= 2023/04/25	`https://storage.googleapis.com/pytorch-xla-releases/wheels/cuda/11.8/torch_xla-nightly-cp38-cp38-linux_x86_64.whl`

Version	GPU Wheel + Python 3.7
1.13	`https://storage.googleapis.com/tpu-pytorch/wheels/cuda/112/torch_xla-1.13-cp37-cp37m-linux_x86_64.whl`
1.12	`https://storage.googleapis.com/tpu-pytorch/wheels/cuda/112/torch_xla-1.12-cp37-cp37m-linux_x86_64.whl`
1.11	`https://storage.googleapis.com/tpu-pytorch/wheels/cuda/112/torch_xla-1.11-cp37-cp37m-linux_x86_64.whl`

Version	Colab TPU Wheel
2.0	`https://storage.googleapis.com/tpu-pytorch/wheels/colab/torch_xla-2.0-cp310-cp310-linux_x86_64.whl`

Docker

Version	Cloud TPU VMs Docker
2.3	`us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/xla:r2.3.0_3.10_tpuvm`
2.2	`us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/xla:r2.2.0_3.10_tpuvm`
2.1	`us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/xla:r2.1.0_3.10_tpuvm`
2.0	`gcr.io/tpu-pytorch/xla:r2.0_3.8_tpuvm`
1.13	`gcr.io/tpu-pytorch/xla:r1.13_3.8_tpuvm`
nightly python	`us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/xla:nightly_3.10_tpuvm`

To use the above dockers, please pass --privileged --net host --shm-size=16G along. Here is an example:

docker run --privileged --net host --shm-size=16G -it us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/xla:nightly_3.10_tpuvm /bin/bash

Version	GPU CUDA 12.1 Docker
2.3	`us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/xla:r2.3.0_3.10_cuda_12.1`
2.2	`us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/xla:r2.2.0_3.10_cuda_12.1`
2.1	`us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/xla:r2.1.0_3.10_cuda_12.1`
nightly	`us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/xla:nightly_3.8_cuda_12.1`
nightly at date	`us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/xla:nightly_3.8_cuda_12.1_YYYYMMDD`

Version	GPU CUDA 11.8 + Docker
2.1	`us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/xla:r2.1.0_3.10_cuda_11.8`
2.0	`gcr.io/tpu-pytorch/xla:r2.0_3.8_cuda_11.8`
nightly	`us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/xla:nightly_3.8_cuda_11.8`
nightly at date	`us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/xla:nightly_3.8_cuda_11.8_YYYYMMDD`

older versions

Version	GPU CUDA 11.7 + Docker
2.0	`gcr.io/tpu-pytorch/xla:r2.0_3.8_cuda_11.7`

Version	GPU CUDA 11.2 + Docker
1.13	`gcr.io/tpu-pytorch/xla:r1.13_3.8_cuda_11.2`

Version	GPU CUDA 11.2 + Docker
1.13	`gcr.io/tpu-pytorch/xla:r1.13_3.7_cuda_11.2`
1.12	`gcr.io/tpu-pytorch/xla:r1.12_3.7_cuda_11.2`

To run on compute instances with GPUs.

Troubleshooting

If PyTorch/XLA isn't performing as expected, see the troubleshooting guide, which has suggestions for debugging and optimizing your network(s).

Providing Feedback

The PyTorch/XLA team is always happy to hear from users and OSS contributors! The best way to reach out is by filing an issue on this Github. Questions, bug reports, feature requests, build issues, etc. are all welcome!

Contributing

See the contribution guide.

Disclaimer

This repository is jointly operated and maintained by Google, Meta and a number of individual contributors listed in the CONTRIBUTORS file. For questions directed at Meta, please send an email to [email protected]. For questions directed at Google, please send an email to [email protected]. For all other questions, please open up an issue in this repository here.

Additional Reads

You can find additional useful reading materials in

Related Projects

xla's People

Contributors

Stargazers

Watchers

Forkers

zgsxwsdxg aiyuhonghong johndpope hyzcn vishalsubbiah ailzhang s-you yyht piegu dahburj mattalexmiracle williamfalcon saiuz wyvern92 c00lrain davidko3 prismah taylanbil kolinguo berfubuyukoz jysohn23 ziatdinovmax ifedan h-yanagawa 896845927 zcain117 achbogga guomingxie taejjing geochri unendin mkrima andrewyates zou3519 xudong-ma jeonsworld mruberry volcacius sakaia gptcod min-xu-ai whtngus yf225 tmabraham syoamakase laksh9950 stjordanis amirstudy victor8733 yzma-robotics mhdella northtiger selvakumar-sss qwshy hell-to-heaven guolong-zhang calmke ajinkz sprt ngtmarc davidcpage xuhdev olgaiv39 mikewlange hephaex henrytansetiawan benjamintaiwo strategist922 endldreamer tuannguyen27 mbrukman pritamdamania87 peterbell10 soumyarooproy mohitreddy1996 dgandiaga-stratio pbelevich kuaikuaikim adityasoni19031997 al-yakubovich jspisak devenlu ifigotin zuka1129 linhduongtuan sieginglion anirudhaghosh ultrons naveenm2014 awoziji hujiatao0 liu-yikang jamshaidsohail5 lianqing11 bobzhuyb reference-repo vfdev-5 seemethere srirangatarun shawnzhong

xla's Issues

empty ndim index

import torch
import torch_xla
import torch_xla_py.utils as xu
import torch_xla_py.xla_model as xm
import pdb
import torch.nn.functional as F
xla_device = xm.xla_device()
xla_device = 'cpu'
v = torch.randn(2, 3, 4, 5).to(xla_device)
a = torch.empty(2, 0, 6, 4, 5, device=xla_device)
b = v[:, torch.empty(0, 6, dtype=torch.int64, device=xla_device)]
pdb.set_trace()
print(a.shape)
print(b.shape)

Got

torch.Size([2, 0, 6, 4, 5])
torch.Size([0, 6, 2, 4, 5])

Expect:

torch.Size([2, 0, 6, 4, 5])
torch.Size([2, 0, 6, 4, 5])

PyTorch/XLA GPU support

I would like to use PyTorch/XLA on GPUs.

But, PyTorch/XLA doesn't seem to work on them since XLA_GPU_DEVICE isn't registered, although xla_gpu_device.cpp seems to be included in the appropriate BUILD file since use_cuda==True in .bazelrc.

Thank You !

please use release with torch / tpu versions

I can not use xla since it requires tpu for tf 1.14, which is not released at this moment.

Please use 'release' feature in github with reproducible versions for torch, tpu and etcs.

It is very discouraging to use this project.

Error when building locally

Selected XLAC library
ln: failed to create symbolic link '/home/kostik/pytorch/xla/build/lib.linux-x86_64-2.7/libptxla.so' -> '': No such file or directory
CMake Error: The following variables are used in this project, but they are set to NOTFOUND.
Please set them or make sure they are set and tested correctly in the CMake files:
PTXLA_LIB
linked by target "test_ptxla" in directory /home/kostik/pytorch/xla/test/cpp

-- Configuring incomplete, errors occurred!
See also "/home/kostik/pytorch/xla/test/cpp/build/CMakeFiles/CMakeOutput.log".
Failed to build tests: ['/home/kostik/pytorch/xla/test/cpp/run_tests.sh', '-B']

How can I work this around?

Build issue on GCP

nvcc fatal : The version ('7.0') of the host compiler ('clang') is not supported
Makefile:52: recipe for target '/home/jupyter/pytorch/build/nccl/obj/collectives/device/reduce_scatter.dep' failed
make[2]: *** [/home/jupyter/pytorch/build/nccl/obj/collectives/device/reduce_scatter.dep] Error 1
nvcc fatal : The version ('7.0') of the host compiler ('clang') is not supported
Makefile:52: recipe for target '/home/jupyter/pytorch/build/nccl/obj/collectives/device/functions.dep' failed
make[2]: *** [/home/jupyter/pytorch/build/nccl/obj/collectives/device/functions.dep] Error 1
nvcc fatal : The version ('7.0') of the host compiler ('clang') is not supported
Makefile:52: recipe for target '/home/jupyter/pytorch/build/nccl/obj/collectives/device/reduce.dep' failed
make[2]: *** [/home/jupyter/pytorch/build/nccl/obj/collectives/device/reduce.dep] Error 1
nvcc fatal : The version ('7.0') of the host compiler ('clang') is not supported
Makefile:52: recipe for target '/home/jupyter/pytorch/build/nccl/obj/collectives/device/all_reduce.dep' failed
make[2]: *** [/home/jupyter/pytorch/build/nccl/obj/collectives/device/all_reduce.dep] Error 1
nvcc fatal : The version ('7.0') of the host compiler ('clang') is not supported
Makefile:52: recipe for target '/home/jupyter/pytorch/build/nccl/obj/collectives/device/broadcast.dep' failed
make[2]: *** [/home/jupyter/pytorch/build/nccl/obj/collectives/device/broadcast.dep] Error 1
make[2]: Leaving directory '/home/jupyter/pytorch/third_party/nccl/nccl/src/collectives/device'
Makefile:44: recipe for target '/home/jupyter/pytorch/build/nccl/obj/collectives/device/colldevice.a' failed
make[1]: *** [/home/jupyter/pytorch/build/nccl/obj/collectives/device/colldevice.a] Error 2
make[1]: Leaving directory '/home/jupyter/pytorch/third_party/nccl/nccl/src'
Makefile:25: recipe for target 'src.build' failed
make: *** [src.build] Error 2
[12/3115] Building CXX object third_party/protobuf/cmake/CMakeFiles/libprotobuf.dir/__/src/google/protobuf/extension_set.cc.o
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
File "setup.py", line 728, in
build_deps()
File "setup.py", line 294, in build_deps
build_dir='build')
File "/home/jupyter/pytorch/tools/build_pytorch_libs.py", line 293, in build_caffe2
check_call(ninja_cmd, cwd=build_dir, env=my_env)
File "/opt/anaconda3/lib/python3.7/subprocess.py", line 341, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['ninja', 'install']' returned non-zero exit status 1

wrap numbers as 0dim tensor

Minimal repro script:

import torch
import torch_xla
import torch_xla_py.utils as xu
import torch_xla_py.xla_model as xm

a = torch.randn(3, 800, 1066)
xla_device = xm.xla_device()

b = a.to(xla_device)
c = b + 1
print(c)

new torch.bool

Recently torch.bool became a thing. See https://github.com/pytorch/pytorch/commits?author=izdeby Is it mapping to the xla::PRED type?

import torch
import torch_xla
import torch_xla_py.utils as xu
import torch_xla_py.xla_model as xm
import pdb
import torch.nn.functional as F
xla_device = xm.xla_device()
# xla_device = 'cpu'
pdb.set_trace()
b = torch.tensor([True, False, True, True, False], dtype=torch.bool).to(xla_device)

Add lowering for pooling with `ceil_mode` set.

Right now we dispatch that mode through interop, add XLA lowering for this mode as well.

Assignment with bool index

import torch
import torch_xla
import torch_xla_py.utils as xu
import torch_xla_py.xla_model as xm
import pdb
import torch.nn.functional as F
xla_device = xm.xla_device()
# xla_device = 'cpu'
x = torch.rand(2, 3).to(xla_device)
neg_ones = torch.ones_like(x) * -1
neg_ones_expanded = neg_ones.unsqueeze(0).unsqueeze(0)
x[True] = neg_ones_expanded
print(x)

Expected all -1s, got unchanged x

Add lowering for max pooling with dilation

Currently, max pooling with dilation is dispatched through the interop mechanism. We can use xla::ReduceWindowWithGeneralPadding and implement dilation as well.

AtenXlaTensorTest.TestWhere broken on TPU

Running the C++ tests we get the following stack trace:

unknown file: Failure
C++ exception with description "tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:218 : Check failed: mwait.Wait() == ::tensorflow::Status::OK() (Aborted: tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:198 : Check failed:
 session->session()->Run( session_work.feed_inputs, session_work.outputs_handles, &outputs) == ::tensorflow::Status::OK() (Unimplemented: While rewriting computation to not contain X64 element types, XLA encountered an HLO for which this rewritin
g is not implemented: %convert.4 = pred[3,3]{1,0} convert(s64[3,3]{1,0} %constant.3)
         [[{{node XRTCompile}}]]
         [[XRTCompile_G20]] vs. OK)
*** Begin stack trace ***
        tensorflow::CurrentStackTrace[abi:cxx11]()
        Eigen::ThreadPoolTempl<tensorflow::thread::EigenEnvironment>::WorkerLoop(int)
        std::_Function_handler<void (), tensorflow::thread::EigenEnvironment::CreateThread(std::function<void ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&)
        clone
*** End stack trace ***
 vs. OK)" thrown in the test body.

Op type not registered Error

After successfully compiling xla on a Google Cloud server, I am trying to use a TPU instance to run the tests. However, I get the following error message after setting the TPU worker IP address and running python test/test_operations:

======================================================================
ERROR: test_view (__main__.TestXLATensor)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test/test_operations.py", line 1346, in test_view
    out = xt_x.view(-1, 320).to_tensor()
RuntimeError: tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:87 : Check failed: session_work.first->session()->Run( session_work.second.feed_inputs, session_work.second.outputs_handles, &outputs) == ::tensorflow::Status::OK() (Not found: Op type not registered 'XRTAllocateFromTensor' in binary running on n-04d145c9-w-0. Make sure the Op and Kernel are registered in the binary running in this process. Note that if you are loading a saved graph which used ops from tf.contrib, accessing (e.g.) `tf.contrib.resampler` should be done before importing the graph, as contrib ops are lazily registered when the module is first accessed. vs. OK)
*** Begin stack trace ***
	tensorflow::CurrentStackTrace[abi:cxx11]()
	xla::XrtComputationClient::TransferToServer(absl::Span<xla::ComputationClient::TensorSource const>)
	torch_xla::TensorToXlaData(at::Tensor const&, torch_xla::Device const&)
	torch_xla::XLATensor::GetIrValue(at::Tensor const&) const
	torch_xla::XLATensor::GetIrValue() const
	torch_xla::XLATensor::CreateView(absl::Span<long long const>) const
	torch_xla::XLATensor::view(torch_xla::XLATensor const&, absl::Span<long long const>)


	_PyMethodDef_RawFastCallKeywords
	_PyCFunction_FastCallKeywords
	_PyEval_EvalFrameDefault
	_PyFunction_FastCallKeywords
	_PyEval_EvalFrameDefault
	_PyEval_EvalCodeWithName
	_PyFunction_FastCallDict
	_PyObject_Call_Prepend
	PyObject_Call
	_PyEval_EvalFrameDefault
	_PyEval_EvalCodeWithName
	_PyFunction_FastCallDict
	_PyObject_Call_Prepend

	_PyObject_FastCallKeywords
	_PyEval_EvalFrameDefault
	_PyEval_EvalCodeWithName
	_PyFunction_FastCallDict
	_PyObject_Call_Prepend
	PyObject_Call
	_PyEval_EvalFrameDefault
	_PyEval_EvalCodeWithName
	_PyFunction_FastCallDict
	_PyObject_Call_Prepend

	_PyObject_FastCallKeywords
	_PyEval_EvalFrameDefault
	_PyEval_EvalCodeWithName
	_PyFunction_FastCallDict
	_PyObject_Call_Prepend
	PyObject_Call
	_PyEval_EvalFrameDefault
	_PyEval_EvalCodeWithName
	_PyFunction_FastCallDict
	_PyObject_Call_Prepend

	_PyObject_FastCallKeywords
	_PyEval_EvalFrameDefault
	_PyFunction_FastCallKeywords
	_PyEval_EvalFrameDefault
	_PyFunction_FastCallKeywords
	_PyEval_EvalFrameDefault
	_PyEval_EvalCodeWithName
	_PyFunction_FastCallDict
	_PyObject_Call_Prepend

	_PyObject_FastCallKeywords
	_PyEval_EvalFrameDefault

	_PyEval_EvalFrameDefault
	_PyEval_EvalCodeWithName
	PyEval_EvalCodeEx
	PyEval_EvalCode

	PyRun_FileExFlags
	PyRun_SimpleFileExFlags

	_Py_UnixMain
	__libc_start_main

*** End stack trace ***


----------------------------------------------------------------------

The TF version on the TPU node is 1.12 and I am using the master branch from this repo as well as Pytorch.

unsupported type: pred

It was not straightforward to copy the computation to form a minimal python repro, as I'm not sure how xla::PRED scalar could be created from python interface. But I guess this backtrace is helpful.
Basically it's complaining alpha in sub_(self, other, alpha) having element type xla::PRED.

Thread 1 "python" hit Breakpoint 1, torch_xla::TensorTypeFromXlaType (xla_type=xla::PRED)
    at torch_xla/csrc/tensor_util.cpp:547
547           XLA_ERROR() << "XLA type not supported: " << xla_type;
(gdb) bt
#0  torch_xla::TensorTypeFromXlaType (xla_type=xla::PRED) at torch_xla/csrc/tensor_util.cpp:547
#1  0x00007fffdc506035 in torch_xla::XLATensor::GetIrValueForScalar (value=..., type=xla::PRED, device=...)
    at torch_xla/csrc/tensor.cpp:522
#2  0x00007fffdc506135 in torch_xla::XLATensor::GetIrValueForScalar (value=..., shape=..., device=...)
    at torch_xla/csrc/tensor.cpp:529
#3  0x00007fffdc4af1ee in torch_xla::XLATensor::sub_ (input=..., other=..., alpha=...)
    at torch_xla/csrc/tensor_methods.cpp:1855
#4  0x00007fffdc4867b3 in torch_xla::AtenXlaType::sub_ (this=0x555557046ad0, self=..., other=..., alpha=...)
    at torch_xla/csrc/aten_xla_type.cpp:2582
#5  0x00007fffe14f36e3 in torch::autograd::VariableType::sub_ (this=0x555557054900, self=..., other=...,
    alpha=...) at ../torch/csrc/autograd/generated/VariableType_0.cpp:15652
#6  0x00007fffe8c5d50f in at::Tensor::sub_ (this=0x7fffffffbec8, other=..., alpha=...)
    at ../aten/src/ATen/core/TensorMethods.h:704
#7  0x00007fffe8bc020c in at::operator- (x=..., y=...) at ../aten/src/ATen/TensorOperators.h:96
#8  0x00007fffe8bc4ddd in torch::autograd::dispatch_invert (self=...)
    at ../torch/csrc/autograd/generated/python_variable_methods.cpp:271

Error when running simple mnist demo

I made this demo from one of the tests, and simplified it into a standalone mnist demo. However, this error happens when it runs:

(tpu) waf2107@waf2107:~/pytorch/xla/demo$ python mnist.py 
2019-05-08 00:48:51.601926: I tensorflow/compiler/xla/xla_client/computation_client.cc:125] 
Loading XRT configuration from /home/waf2107/.pytorch_tpu.conf
2019-05-08 00:48:51.602163: I tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:3
9] XRT device CPU:0 -> /job:tpu_worker/replica:0/task:0/device:XLA_CPU:0
2019-05-08 00:48:51.602194: I tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:3
9] XRT device TPU:0 -> /job:tpu_worker/replica:0/task:0/device:TPU:0
2019-05-08 00:48:51.602206: I tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:3
9] XRT device TPU:1 -> /job:tpu_worker/replica:0/task:0/device:TPU:1
2019-05-08 00:48:51.602216: I tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:3
9] XRT device TPU:2 -> /job:tpu_worker/replica:0/task:0/device:TPU:2
2019-05-08 00:48:51.602223: I tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:3
9] XRT device TPU:3 -> /job:tpu_worker/replica:0/task:0/device:TPU:3
2019-05-08 00:48:51.602229: I tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:3
9] XRT device TPU:4 -> /job:tpu_worker/replica:0/task:0/device:TPU:4
2019-05-08 00:48:51.602251: I tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:3
9] XRT device TPU:5 -> /job:tpu_worker/replica:0/task:0/device:TPU:5
2019-05-08 00:48:51.602258: I tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:3
9] XRT device TPU:6 -> /job:tpu_worker/replica:0/task:0/device:TPU:6
2019-05-08 00:48:51.602264: I tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:3
9] XRT device TPU:7 -> /job:tpu_worker/replica:0/task:0/device:TPU:7
2019-05-08 00:48:51.602275: I tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:4
2] XRT default device: TPU:0
2019-05-08 00:48:51.602323: I tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:7
78] Initializing TPU system for worker tpu_worker:0 at grpc://10.240.1.2:8470
2019-05-08 00:48:55.209839: I tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:8
21] TPU topology: mesh_shape: 2
mesh_shape: 2
mesh_shape: 2
num_tasks: 1
num_tpu_devices_per_task: 8
device_coordinates: 0
device_coordinates: 0
device_coordinates: 0
device_coordinates: 0
device_coordinates: 0
device_coordinates: 1
device_coordinates: 0
device_coordinates: 1
device_coordinates: 0
device_coordinates: 0
device_coordinates: 0
device_coordinates: 1
device_coordinates: 1
device_coordinates: 1
device_coordinates: 0
device_coordinates: 0
device_coordinates: 1
device_coordinates: 0
device_coordinates: 1
device_coordinates: 1
device_coordinates: 1
device_coordinates: 0
device_coordinates: 1
device_coordinates: 1
device_coordinates: 1
2019-05-08 00:48:55.742759: E tensorflow/compiler/xla/xla_client/tf_logging.cc:11] Check fai
led: session_work.first->session()->Run( session_work.second.feed_inputs, session_work.secon
d.outputs_handles, &outputs) == ::tensorflow::Status::OK() (Not found: Op type not registere
d 'XRTAllocateFromTensor' in binary running on n-897fcadd-w-0. Make sure the Op and Kernel a
re registered in the binary running in this process. Note that if you are loading a saved gr
aph which used ops from tf.contrib, accessing (e.g.) `tf.contrib.resampler` should be done b
efore importing the graph, as contrib ops are lazily registered when the module is first acc
essed. vs. OK)
*** Begin stack trace ***
        tensorflow::CurrentStackTrace[abi:cxx11]()
        xla::XrtComputationClient::TransferToServer(absl::Span<xla::ComputationClient::Tenso
rSource const>)
        torch_xla::TensorToXlaData(at::Tensor const&, torch_xla::Device const&)
        torch_xla::XLATensor::GetIrValueForScalar(c10::Scalar, xla::PrimitiveType, torch_xla
::Device const&)
        torch_xla::XLATensor::GetIrValueForScalar(c10::Scalar, xla::Shape const&, torch_xla:
:Device const&)
        torch_xla::XLATensor::full(absl::Span<long long const>, c10::Scalar, torch_xla::Devi
ce const&, c10::ScalarType)
        torch_xla::AtenXlaType::full(c10::ArrayRef<long>, c10::Scalar, c10::TensorOptions co
nst&) const
        torch_xla::AtenXlaType::empty(c10::ArrayRef<long>, c10::TensorOptions const&) const
        torch::autograd::VariableType::empty(c10::ArrayRef<long>, c10::TensorOptions const&)
 const
        at::native::to(at::Tensor const&, c10::TensorOptions const&, bool, bool)
        at::TypeDefault::to(at::Tensor const&, c10::TensorOptions const&, bool, bool) const
        torch::autograd::VariableType::to(at::Tensor const&, c10::TensorOptions const&, bool
, bool) const

        _PyMethodDef_RawFastCallKeywords
        _PyMethodDescr_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyEval_EvalCodeWithName
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyEval_EvalCodeWithName
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyEval_EvalCodeWithName
        _PyFunction_FastCallDict
        _PyObject_Call_Prepend
        _PyObject_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyEval_EvalCodeWithName
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyEval_EvalCodeWithName
        PyEval_EvalCodeEx
        PyEval_EvalCode
        PyRun_FileExFlags
        PyRun_SimpleFileExFlags
        _Py_UnixMain
        __libc_start_main
*** End stack trace ***



Traceback (most recent call last):
  File "mnist.py", line 116, in <module>
    train_mnist()
  File "mnist.py", line 75, in train_mnist
    model_parallel = dp.DataParallel(MNIST, device_ids=devices)
  File "/home/waf2107/miniconda3/envs/tpu/lib/python3.7/site-packages/torch_xla-0.1-py3.7-li
nux-x86_64.egg/torch_xla_py/data_parallel.py", line 164, in __init__
    module = network().to(device=torch.device(device))
  File "/home/waf2107/miniconda3/envs/tpu/lib/python3.7/site-packages/torch/nn/modules/modul
e.py", line 386, in to
    return self._apply(convert)
  File "/home/waf2107/miniconda3/envs/tpu/lib/python3.7/site-packages/torch/nn/modules/modul
e.py", line 193, in _apply
    module._apply(fn)
  File "/home/waf2107/miniconda3/envs/tpu/lib/python3.7/site-packages/torch/nn/modules/modul
e.py", line 199, in _apply
    param.data = fn(param.data)
  File "/home/waf2107/miniconda3/envs/tpu/lib/python3.7/site-packages/torch/nn/modules/modul
e.py", line 384, in convert
    return t.to(device, dtype if t.is_floating_point() else None, non_blocking)
RuntimeError: tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:100 : Check faile
d: session_work.first->session()->Run( session_work.second.feed_inputs, session_work.second.
outputs_handles, &outputs) == ::tensorflow::Status::OK() (Not found: Op type not registered 
'XRTAllocateFromTensor' in binary running on n-897fcadd-w-0. Make sure the Op and Kernel are
 registered in the binary running in this process. Note that if you are loading a saved grap
h which used ops from tf.contrib, accessing (e.g.) `tf.contrib.resampler` should be done bef
ore importing the graph, as contrib ops are lazily registered when the module is first acces
sed. vs. OK)
*** Begin stack trace ***
        tensorflow::CurrentStackTrace[abi:cxx11]()
        xla::XrtComputationClient::TransferToServer(absl::Span<xla::ComputationClient::Tenso
RuntimeError: tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:100 : Check failed: session_work.first->session()->Run( session_work.second.feed_inputs, session_work.second.
outputs_handles, &outputs) == ::tensorflow::Status::OK() (Not found: Op type not registered 'XRTAllocateFromTensor' in binary running on n-897fcadd-w-0. Make sure the Op and Kernel are
 registered in the binary running in this process. Note that if you are loading a saved graph which used ops from tf.contrib, accessing (e.g.) `tf.contrib.resampler` should be done bef
ore importing the graph, as contrib ops are lazily registered when the module is first accessed. vs. OK)
*** Begin stack trace ***
        tensorflow::CurrentStackTrace[abi:cxx11]()
        xla::XrtComputationClient::TransferToServer(absl::Span<xla::ComputationClient::TensorSource const>)
        torch_xla::TensorToXlaData(at::Tensor const&, torch_xla::Device const&)
        torch_xla::XLATensor::GetIrValueForScalar(c10::Scalar, xla::PrimitiveType, torch_xla::Device const&)
        torch_xla::XLATensor::GetIrValueForScalar(c10::Scalar, xla::Shape const&, torch_xla::Device const&)
        torch_xla::XLATensor::full(absl::Span<long long const>, c10::Scalar, torch_xla::Device const&, c10::ScalarType)
        torch_xla::AtenXlaType::full(c10::ArrayRef<long>, c10::Scalar, c10::TensorOptions const&) const
        torch_xla::AtenXlaType::empty(c10::ArrayRef<long>, c10::TensorOptions const&) const
        torch::autograd::VariableType::empty(c10::ArrayRef<long>, c10::TensorOptions const&) const
        at::native::to(at::Tensor const&, c10::TensorOptions const&, bool, bool)
        at::TypeDefault::to(at::Tensor const&, c10::TensorOptions const&, bool, bool) const
        torch::autograd::VariableType::to(at::Tensor const&, c10::TensorOptions const&, bool, bool) const


        _PyMethodDef_RawFastCallKeywords
        _PyMethodDescr_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyEval_EvalCodeWithName
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyEval_EvalCodeWithName
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyEval_EvalCodeWithName
        _PyFunction_FastCallDict
        _PyObject_Call_Prepend
        _PyObject_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyEval_EvalCodeWithName
        _PyFunction_FastCallKeywords
        _PyEval_EvalFrameDefault
        _PyEval_EvalCodeWithName
        PyEval_EvalCodeEx
        PyEval_EvalCode
        PyRun_FileExFlags
        PyRun_SimpleFileExFlags
        _Py_UnixMain
        __libc_start_main
*** End stack trace ***

type promotion

import torch
import torch_xla
import torch_xla_py.utils as xu
import torch_xla_py.xla_model as xm
import pdb
import torch.nn.functional as F
xla_device = xm.xla_device()
# xla_device = 'cpu'

pdb.set_trace()

a = torch.rand(10).to(xla_device).to(dtype=torch.uint8)
b = torch.tensor([0, 1, 2, 3]).to(xla_device)
print(a.dtype)
print(a[b].dtype) # shows float32, should be uint8

Include CircleCI badge in README

copy_ is no-op

This is the first culprit for maskrcnn wrong results, basically it fails to creat padded image so that all the following results don't make sense.

Minimal repro:

import torch
import torch_xla
import torch_xla_py.utils as xu
import torch_xla_py.xla_model as xm
a = torch.rand(3, 3, 3)
xla_device = xm.xla_device()
a = a.to(xla_device)
b_shape = (4, 4, 4)
b = a.new(*b_shape).zero_()
b[:a.shape[0], :a.shape[1], :a.shape[2]].copy_(a)

print(b)

xla hangs after build

When running tests, the following output is generated but the tests hang...

(tpu) waf2107@waf2107:~/pytorch/xla$ python test/test_operations.py
2019-05-07 23:58:22.590748: I tensorflow/compiler/xla/xla_client/computation_client.cc:125] 
Loading XRT configuration from /home/waf2107/.pytorch_tpu.conf
2019-05-07 23:58:22.591183: I tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:3
9] XRT device CPU:0 -> /job:tpu_worker/replica:0/task:0/device:XLA_CPU:0
2019-05-07 23:58:22.591214: I tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:3
9] XRT device TPU:0 -> /job:tpu_worker/replica:0/task:0/device:TPU:0
2019-05-07 23:58:22.591234: I tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:3
9] XRT device TPU:1 -> /job:tpu_worker/replica:0/task:0/device:TPU:1
2019-05-07 23:58:22.591249: I tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:3
9] XRT device TPU:2 -> /job:tpu_worker/replica:0/task:0/device:TPU:2
2019-05-07 23:58:22.591263: I tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:3
9] XRT device TPU:3 -> /job:tpu_worker/replica:0/task:0/device:TPU:3
2019-05-07 23:58:22.591279: I tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:3
9] XRT device TPU:4 -> /job:tpu_worker/replica:0/task:0/device:TPU:4
2019-05-07 23:58:22.591293: I tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:3
9] XRT device TPU:5 -> /job:tpu_worker/replica:0/task:0/device:TPU:5
2019-05-07 23:58:22.591308: I tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:3
9] XRT device TPU:6 -> /job:tpu_worker/replica:0/task:0/device:TPU:6
2019-05-07 23:58:22.591322: I tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:3
9] XRT device TPU:7 -> /job:tpu_worker/replica:0/task:0/device:TPU:7
2019-05-07 23:58:22.591336: I tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:4
2] XRT default device: TPU:0
2019-05-07 23:58:22.591458: I tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:7
78] Initializing TPU system for worker tpu_worker:0 at grpc://10.128.0.5:8470

import torch_xla_py.data_parallel as dp issues

Hi, after following the install instructions, I get the following error when running tests. It seems to be caused by:

import torch_xla_py.data_parallel as dp issues

(tpu) waf2107@waf2107:~/pytorch/xla$ python app.py 
2019-05-07 01:40:15.424331: I tensorflow/compiler/xla/xla_client/computation_client.cc:125] 
Loading XRT configuration from /home/waf2107/.pytorch_tpu.conf
2019-05-07 01:40:15.424560: I tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:4
4] XRT device CPU:0 -> /job:tpu_worker/replica:0/task:0/device:XLA_CPU:0
2019-05-07 01:40:15.424578: I tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:4
4] XRT device TPU:0 -> /job:tpu_worker/replica:0/task:0/device:TPU:0
2019-05-07 01:40:15.424602: I tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:4
4] XRT device TPU:1 -> /job:tpu_worker/replica:0/task:0/device:TPU:1
2019-05-07 01:40:15.424611: I tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:4
4] XRT device TPU:2 -> /job:tpu_worker/replica:0/task:0/device:TPU:2
2019-05-07 01:40:15.424621: I tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:4
4] XRT device TPU:3 -> /job:tpu_worker/replica:0/task:0/device:TPU:3
2019-05-07 01:40:15.424630: I tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:4
4] XRT device TPU:4 -> /job:tpu_worker/replica:0/task:0/device:TPU:4
2019-05-07 01:40:15.424639: I tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:4
4] XRT device TPU:5 -> /job:tpu_worker/replica:0/task:0/device:TPU:5
2019-05-07 01:40:15.424647: I tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:4
4] XRT device TPU:6 -> /job:tpu_worker/replica:0/task:0/device:TPU:6
2019-05-07 01:40:15.424656: I tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:4
4] XRT device TPU:7 -> /job:tpu_worker/replica:0/task:0/device:TPU:7
2019-05-07 01:40:15.424665: I tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:4
7] XRT default device: TPU:0
2019-05-07 01:40:15.424701: I tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:8
77] Initializing TPU system for worker tpu_worker:0 at grpc://10.240.1.2:8470
2019-05-07 01:40:15.494162: E tensorflow/compiler/xla/xla_client/tf_logging.cc:11] Check fai
led: session->session()->Run({tensorflow::Output(result, 0)}, &outputs) == ::tensorflow::Sta
tus::OK() (Not found: Op type not registered 'XRTExecuteChained' in binary running on n-d9dc
d0ea-w-0. Make sure the Op and Kernel are registered in the binary running in this process.

Type mismatch when doing index_put_ after to(dtype)

import torch
import torch_xla
import torch_xla_py.utils as xu
import torch_xla_py.xla_model as xm
import pdb
xla_device = xm.xla_device()
# xla_device = 'cpu'
a = torch.tensor([1, 2, 3, 4]).to(xla_device).to(dtype=torch.float32)
b = torch.rand(4) == 0.5
pdb.set_trace()
a[b] = 0
print(a)

core dump in slicing

import torch
import torch_xla
import torch_xla_py.utils as xu
import torch_xla_py.xla_model as xm
import pdb
import torch.nn.functional as F
xla_device = xm.xla_device()
# xla_device = 'cpu'
v = torch.randn(2, 3, 4, 5).to(xla_device)
pdb.set_trace()
y = v[:, :, :, 1]
z = y[:, 1:1, :]

Failed to run all test files in xla/test

After I install pytorch and xla, and set the ip of TPU node, I tried to run the test code provided in xla/test, but none of them can run successfully. The error informations of two sample codes are as follows:

~/pytorch/xla/test$ python test_operations.py
2019-04-14 14:33:57.371620: I tensorflow/compiler/xla/xla_client/computation_client.cc:125] Loading XRT configuration from /home/hkbuautoml/.pytorch_tpu.conf
2019-04-14 14:33:57.371773: I tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:48] XRT device CPU:0 -> /job:tpu_worker/replica:0/task:0/device:XLA_CPU:0
2019-04-14 14:33:57.371791: I tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:48] XRT device TPU:0 -> /job:tpu_worker/replica:0/task:0/device:TPU:0
2019-04-14 14:33:57.371813: I tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:48] XRT device TPU:1 -> /job:tpu_worker/replica:0/task:0/device:TPU:1
2019-04-14 14:33:57.371822: I tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:48] XRT device TPU:2 -> /job:tpu_worker/replica:0/task:0/device:TPU:2
2019-04-14 14:33:57.371832: I tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:48] XRT device TPU:3 -> /job:tpu_worker/replica:0/task:0/device:TPU:3
2019-04-14 14:33:57.371842: I tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:48] XRT device TPU:4 -> /job:tpu_worker/replica:0/task:0/device:TPU:4
2019-04-14 14:33:57.371851: I tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:48] XRT device TPU:5 -> /job:tpu_worker/replica:0/task:0/device:TPU:5
2019-04-14 14:33:57.371859: I tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:48] XRT device TPU:6 -> /job:tpu_worker/replica:0/task:0/device:TPU:6
2019-04-14 14:33:57.371867: I tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:48] XRT device TPU:7 -> /job:tpu_worker/replica:0/task:0/device:TPU:7
2019-04-14 14:33:57.371876: I tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:51] XRT default device: TPU:0
2019-04-14 14:33:57.371917: I tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:731] Initializing TPU system for worker tpu_worker:0 at grpc://10.240.1.10:8470
2019-04-14 14:34:03.636433: I tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:774] TPU topology: mesh_shape: 2
mesh_shape: 2
mesh_shape: 2
num_tasks: 1
num_tpu_devices_per_task: 8
device_coordinates: 0
device_coordinates: 0
device_coordinates: 0
device_coordinates: 0
device_coordinates: 0
device_coordinates: 1
device_coordinates: 0
device_coordinates: 1
device_coordinates: 0
device_coordinates: 0
device_coordinates: 1
device_coordinates: 1
device_coordinates: 1
device_coordinates: 0
device_coordinates: 0
device_coordinates: 1
device_coordinates: 0
device_coordinates: 1
device_coordinates: 1
device_coordinates: 1
device_coordinates: 0
device_coordinates: 1
device_coordinates: 1
device_coordinates: 1

2019-04-14 14:34:03.753129: W tensorflow/compiler/xla/shape.cc:41] Malformed shape proto: is_dynamic_dimension is empty
2019-04-14 14:34:03.753212: W tensorflow/compiler/xla/shape.cc:41] Malformed shape proto: is_dynamic_dimension is empty
2019-04-14 14:34:03.759877: E tensorflow/compiler/xla/xla_client/tf_logging.cc:11] Check failed: session_work.first->session()->Run( session_work.second.feed_inputs, session_work.second.outputs_handles, &outputs) == ::tensorflow::Status::OK() (Not found: Op type not registered 'XRTAllocateFromTensor' in binary running on n-8d59f3db-w-0. Make sure the Op and Kernel are registered in the binary running in this process. Note that if you are loading a saved graph which used ops from tf.contrib, accessing (e.g.) `tf.contrib.resampler` should be done before importing the graph, as contrib ops are lazily registered when the module is first accessed. vs. OK)
*** Begin stack trace ***
        tensorflow::CurrentStackTrace[abi:cxx11]()
        xla::XrtComputationClient::TransferToServer(absl::Span<xla::ComputationClient::TensorSource const>)
        torch_xla::TensorToXlaData(at::Tensor const&, torch_xla::Device const&)

        torch_xla::XLATensor::GetIrValueForScalar(c10::Scalar, xla::PrimitiveType, torch_xla::Device const&)
        torch_xla::XLATensor::GetIrValueForScalar(c10::Scalar, xla::Shape const&, torch_xla::Device const&)
        torch_xla::XLATensor::add(torch_xla::XLATensor const&, torch_xla::XLATensor const&, c10::Scalar)
        torch_xla::AtenXlaType::add(at::Tensor const&, at::Tensor const&, c10::Scalar) const
        torch::autograd::VariableType::add(at::Tensor const&, at::Tensor const&, c10::Scalar) const

        PyCFunction_Call
        PyObject_Call


        PyNumber_Add
        _PyEval_EvalFrameDefault


        _PyEval_EvalFrameDefault

        _PyFunction_FastCallDict
        _PyObject_FastCallDict
        _PyObject_Call_Prepend
        PyObject_Call
        _PyEval_EvalFrameDefault

        _PyFunction_FastCallDict
        _PyObject_FastCallDict
        _PyObject_Call_Prepend
        PyObject_Call

        _PyObject_FastCallDict

        _PyEval_EvalFrameDefault

        _PyFunction_FastCallDict
        _PyObject_FastCallDict
        _PyObject_Call_Prepend
        PyObject_Call
        _PyEval_EvalFrameDefault
        _PyFunction_FastCallDict
        _PyObject_FastCallDict
        _PyObject_Call_Prepend
        PyObject_Call

        _PyObject_FastCallDict

        _PyEval_EvalFrameDefault

        _PyFunction_FastCallDict
        _PyObject_FastCallDict
        _PyObject_Call_Prepend
        PyObject_Call
        _PyEval_EvalFrameDefault

        _PyFunction_FastCallDict
        _PyObject_FastCallDict
        _PyObject_Call_Prepend
        PyObject_Call

        _PyObject_FastCallDict

        _PyEval_EvalFrameDefault


        _PyEval_EvalFrameDefault


        _PyEval_EvalFrameDefault

        _PyFunction_FastCallDict
        _PyObject_FastCallDict
        _PyObject_Call_Prepend
        PyObject_Call


        _PyObject_FastCallDict
        _PyObject_FastCallKeywords

        _PyEval_EvalFrameDefault


        _PyEval_EvalFrameDefault
        PyEval_EvalCodeEx
        PyEval_EvalCode

        PyRun_FileExFlags
        PyRun_SimpleFileExFlags
        Py_Main
        main
        __libc_start_main

*** End stack trace ***

E2019-04-14 14:34:03.780553: E tensorflow/compiler/xla/xla_client/tf_logging.cc:11] Check failed: session->session()->Run( feed_inputs, {}, {cached_node.operations[0]}, &outputs) == ::tensorflow::Status::OK() (Internal: handle input should be an int64 scalar
         [[{{node XRTReleaseAllocationHandle}}]] vs. OK)
*** Begin stack trace ***
        tensorflow::CurrentStackTrace[abi:cxx11]()
        xla::XrtComputationClient::ReleaseHandles(std::vector<xla::XrtComputationClient::DeviceHandle, std::allocator<xla::XrtComputationClient::DeviceHandle> >*, std::function<xla::XrtSession::CachedNode const& (xla::XrtSession*, tensorflow::Scope const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)> const&, xla::metrics::Metric*, xla::metrics::Counter*)
        xla::XrtComputationClient::HandleReleaser()
        xla::xla_util::TriggeredTask::Runner()


        clone
*** End stack trace ***

terminate called after throwing an instance of 'std::runtime_error'
  what():  tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:629 : Check failed: session->session()->Run( feed_inputs, {}, {cached_node.operations[0]}, &outputs) == ::tensorflow::Status::OK() (Internal: handle input should be an int64 scalar
         [[{{node XRTReleaseAllocationHandle}}]] vs. OK)
*** Begin stack trace ***
        tensorflow::CurrentStackTrace[abi:cxx11]()
        xla::XrtComputationClient::ReleaseHandles(std::vector<xla::XrtComputationClient::DeviceHandle, std::allocator<xla::XrtComputationClient::DeviceHandle> >*, std::function<xla::XrtSession::CachedNode const& (xla::XrtSession*, tensorflow::Scope const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)> const&, xla::metrics::Metric*, xla::metrics::Counter*)
        xla::XrtComputationClient::HandleReleaser()
        xla::xla_util::TriggeredTask::Runner()


        clone
*** End stack trace ***

Aborted

(py36) hkbuautoml@tputest:~/pytorch/xla/test$ python test_train_cifar.py 
2019-04-14 14:38:35.400546: I tensorflow/compiler/xla/xla_client/computation_client.cc:125] Loading XRT configuration from /home/hkbuautoml/.pytorch_tpu.conf
2019-04-14 14:38:35.400697: I tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:48] XRT device CPU:0 -> /job:tpu_worker/replica:0/task:0/device:XLA_CPU:0
2019-04-14 14:38:35.400714: I tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:48] XRT device TPU:0 -> /job:tpu_worker/replica:0/task:0/device:TPU:0
2019-04-14 14:38:35.400725: I tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:48] XRT device TPU:1 -> /job:tpu_worker/replica:0/task:0/device:TPU:1
2019-04-14 14:38:35.400736: I tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:48] XRT device TPU:2 -> /job:tpu_worker/replica:0/task:0/device:TPU:2
2019-04-14 14:38:35.400746: I tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:48] XRT device TPU:3 -> /job:tpu_worker/replica:0/task:0/device:TPU:3
2019-04-14 14:38:35.400755: I tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:48] XRT device TPU:4 -> /job:tpu_worker/replica:0/task:0/device:TPU:4
2019-04-14 14:38:35.400765: I tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:48] XRT device TPU:5 -> /job:tpu_worker/replica:0/task:0/device:TPU:5
2019-04-14 14:38:35.400776: I tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:48] XRT device TPU:6 -> /job:tpu_worker/replica:0/task:0/device:TPU:6
2019-04-14 14:38:35.400785: I tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:48] XRT device TPU:7 -> /job:tpu_worker/replica:0/task:0/device:TPU:7
2019-04-14 14:38:35.400795: I tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:51] XRT default device: TPU:0
2019-04-14 14:38:35.400825: I tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:731] Initializing TPU system for worker tpu_worker:0 at grpc://10.240.1.10:8470
2019-04-14 14:38:42.846154: I tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:774] TPU topology: mesh_shape: 2
mesh_shape: 2
mesh_shape: 2
num_tasks: 1
num_tpu_devices_per_task: 8
device_coordinates: 0
device_coordinates: 0
device_coordinates: 0
device_coordinates: 0
device_coordinates: 0
device_coordinates: 1
device_coordinates: 0
device_coordinates: 1
device_coordinates: 0
device_coordinates: 0
device_coordinates: 1
device_coordinates: 1
device_coordinates: 1
device_coordinates: 0
device_coordinates: 0
device_coordinates: 1
device_coordinates: 0
device_coordinates: 1
device_coordinates: 1
device_coordinates: 1
device_coordinates: 0
device_coordinates: 1
device_coordinates: 1
device_coordinates: 1

==> Preparing data..
Files already downloaded and verified
Files already downloaded and verified
==> Building model..
[':0']
torch.Size([128, 3, 32, 32])
2019-04-14 14:39:10.685305: E tensorflow/compiler/xla/xla_client/tf_logging.cc:11] Check failed: session_work.first->session()->Run( session_work.second.feed_inputs, session_work.second.outputs_handles, &outputs) == ::tensorflow::Status::OK() (Not found: Op type not registered 'XRTAllocateFromTensor' in binary running on n-8d59f3db-w-0. Make sure the Op and Kernel are registered in the binary running in this process. Note that if you are loading a saved graph which used ops from tf.contrib, accessing (e.g.) `tf.contrib.resampler` should be done before importing the graph, as contrib ops are lazily registered when the module is first accessed. vs. OK)
*** Begin stack trace ***
        tensorflow::CurrentStackTrace[abi:cxx11]()
        xla::XrtComputationClient::TransferToServer(absl::Span<xla::ComputationClient::TensorSource const>)
        torch_xla::CreateTensorsData(std::vector<at::Tensor, std::allocator<at::Tensor> > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&)
        torch_xla::XLATensor::CreateTensors(std::vector<at::Tensor, std::allocator<at::Tensor> > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&)


        _PyCFunction_FastCallDict

        _PyEval_EvalFrameDefault



        _PyEval_EvalFrameDefault


        _PyEval_EvalFrameDefault



        _PyEval_EvalFrameDefault



        _PyEval_EvalFrameDefault

        _PyFunction_FastCallDict
        _PyObject_FastCallDict
        _PyObject_Call_Prepend
        _PyObject_Call


        _PyObject_FastCallDict
        _PyObject_FastCallKeywords

        _PyEval_EvalFrameDefault


        _PyEval_EvalFrameDefault
        PyEval_EvalCodeEx
        PyEval_EvalCode

        PyRun_FileExFlags
        PyRun_SimpleFileExFlags
        Py_Main
        main
        __libc_start_main

*** End stack trace ***

Traceback (most recent call last):
  File "test_train_cifar.py", line 184, in <module>
    acc =  train_cifar()
  File "test_train_cifar.py", line 152, in train_cifar
    devices=devices)
  File "/home/hkbuautoml/pytorch/xla/torch_xla_py/xla_model.py", line 595, in __init__
    input_gradients=loss_output_grads)
  File "/home/hkbuautoml/pytorch/xla/torch_xla_py/xla_model.py", line 288, in create_xla_model
    inputs_xla = convert_to_xla_tensors(replica_inputs, devices=devices)
  File "/home/hkbuautoml/pytorch/xla/torch_xla_py/xla_model.py", line 259, in convert_to_xla_tensors
    arena.convert()
  File "/home/hkbuautoml/pytorch/xla/torch_xla_py/xla_model.py", line 176, in convert
    self._converted_tensors = self.convert_fn(self._tensors, self._devices)
  File "/home/hkbuautoml/pytorch/xla/torch_xla_py/xla_model.py", line 255, in convert
    return torch_xla._XLAC._xla_create_tensors(tensors, devices)
RuntimeError: tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:109 : Check failed: session_work.first->session()->Run( session_work.second.feed_inputs, session_work.second.outputs_handles, &outputs) == ::tensorflow::Status::OK() (Not found: Op type not registered 'XRTAllocateFromTensor' in binary running on n-8d59f3db-w-0. Make sure the Op and Kernel are registered in the binary running in this process. Note that if you are loading a saved graph which used ops from tf.contrib, accessing (e.g.) `tf.contrib.resampler` should be done before importing the graph, as contrib ops are lazily registered when the module is first accessed. vs. OK)
*** Begin stack trace ***
        tensorflow::CurrentStackTrace[abi:cxx11]()
        xla::XrtComputationClient::TransferToServer(absl::Span<xla::ComputationClient::TensorSource const>)
        torch_xla::CreateTensorsData(std::vector<at::Tensor, std::allocator<at::Tensor> > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&)
        torch_xla::XLATensor::CreateTensors(std::vector<at::Tensor, std::allocator<at::Tensor> > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&)


        _PyCFunction_FastCallDict

        _PyEval_EvalFrameDefault



        _PyEval_EvalFrameDefault


        _PyEval_EvalFrameDefault



        _PyEval_EvalFrameDefault


        _PyCFunction_FastCallDict

        _PyEval_EvalFrameDefault



        _PyEval_EvalFrameDefault


        _PyEval_EvalFrameDefault



        _PyEval_EvalFrameDefault



        _PyEval_EvalFrameDefault

        _PyFunction_FastCallDict
        _PyObject_FastCallDict
        _PyObject_Call_Prepend
        PyObject_Call


        _PyObject_FastCallDict
        _PyObject_FastCallKeywords

        _PyEval_EvalFrameDefault


        _PyEval_EvalFrameDefault
        PyEval_EvalCodeEx
        PyEval_EvalCode

        PyRun_FileExFlags
        PyRun_SimpleFileExFlags
        Py_Main
        main
        __libc_start_main

Our JIT passes barks with with PT JIT changes

After #569 is in, we build, but now we have issues when running tests.

======================================================================
ERROR: test (__main__.TestMNIST)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test/test_operations.py", line 607, in test
    self.compareModel(model, x)
  File "test/test_operations.py", line 209, in compareModel
    xla_model = xm.XlaModel(model, [input], full_conv_precision=True)
  File "/usr/local/google/home/dlibenzi/google-git/pytorch/xla/torch_xla_py/xla_model.py", line 599, in __init__
    self._model_fn, inputs, num_cores=self._num_cores, devices=devices)
  File "/usr/local/google/home/dlibenzi/google-git/pytorch/xla/torch_xla_py/xla_model.py", line 291, in create_xla_model
    xla_model(*inputs_xla)
RuntimeError: torch_xla/csrc/passes/remove_unused_forward_outputs.cpp:121 : Check failed: df_input_vjps_it != gradient->df_input_vjps.end() 
*** Begin stack trace ***
	tensorflow::CurrentStackTrace[abi:cxx11]()
	
	torch::jit::RemoveUnusedForwardOutputs(torch::jit::Gradient*)
	torch_xla::XlaModule::ComputeGradient(std::shared_ptr<torch::jit::Graph> const&)
	torch_xla::XlaModule::Initialize(std::vector<std::vector<torch_xla::XLATensor, std::allocator<torch_xla::XLATensor> >, std::allocator<std::vector<torch_xla::XLATensor, std::allocator<torch_xla::XLATensor> > > > const&)
	torch_xla::XlaModule::forward(std::vector<std::vector<torch_xla::XLATensor, std::allocator<torch_xla::XLATensor> >, std::allocator<std::vector<torch_xla::XLATensor, std::allocator<torch_xla::XLATensor> > > > const&)

After exporting TF_CPP_VMODULE we get:

Before RemoveUnusedForwardOutputs:
Forward:
graph(%input.2 : Float(32, 1, 28, 28),
      %1 : Float(10, 1, 5, 5),
      %2 : Float(10),
      %3 : Float(20, 10, 5, 5),
      %4 : Float(20),
      %weight.1 : Float(50, 320),
      %bias.1 : Float(50),
      %weight.2 : Float(10, 50),
      %bias : Float(10)):
  %9 : int[] = prim::Constant[value=[5, 5]]()
  %10 : int[] = prim::Constant[value=[1, 1]]()
  %11 : int[] = prim::Constant[value=[0, 0]]()
  %12 : Float(32, 10, 24, 24), %13 : Tensor, %14 : Tensor = aten::thnn_conv2d_forward(%input.2, %1, %9, %2, %10, %11)
  %15 : int[] = prim::Constant[value=[2, 2]]()
  %16 : int[] = prim::Constant[value=[]]()
  %19 : bool = prim::Constant[value=0](), scope: XlaMNIST
  %input.3 : Float(32, 10, 12, 12), %21 : Long(32, 10, 12, 12) = aten::max_pool2d_with_indices(%12, %15, %16, %11, %10, %19), scope: XlaMNIST
  %input.5 : Float(32, 10, 12, 12) = aten::relu(%input.3), scope: XlaMNIST
  %26 : Float(32, 20, 8, 8), %27 : Tensor, %28 : Tensor = aten::thnn_conv2d_forward(%input.5, %3, %9, %4, %10, %11)
  %input.6 : Float(32, 20, 4, 4), %35 : Long(32, 20, 4, 4) = aten::max_pool2d_with_indices(%26, %15, %16, %11, %10, %19), scope: XlaMNIST
  %x.1 : Float(32, 20, 4, 4) = aten::relu(%input.6), scope: XlaMNIST
  %37 : int[] = prim::Constant[value=[-1, 320]]()
  %input.8 : Float(32, 320) = aten::view(%x.1, %37), scope: XlaMNIST
  %self_size.2 : int[] = aten::size(%x.1)
  %39 : Float(320!, 50!) = aten::t(%weight.1), scope: XlaMNIST/Linear[fc1]
  %40 : Float(32, 50) = aten::mm(%input.8, %39), scope: XlaMNIST/Linear[fc1]
  %41 : int = prim::Constant[value=1]()
  %103 : int[] = aten::size(%bias.1)
  %106 : int[] = aten::size(%40)
  %42 : Float(32, 50) = aten::add(%bias.1, %40, %41), scope: XlaMNIST/Linear[fc1]
  %input.10 : Float(32, 50) = aten::relu(%42), scope: XlaMNIST
  %44 : Float(50!, 10!) = aten::t(%weight.2), scope: XlaMNIST/Linear[fc2]
  %45 : Float(32, 10) = aten::mm(%input.10, %44), scope: XlaMNIST/Linear[fc2]
  %63 : int[] = aten::size(%bias)
  %66 : int[] = aten::size(%45)
  %47 : Float(32, 10) = aten::add(%bias, %45, %41), scope: XlaMNIST/Linear[fc2]
  %49 : Float(32, 10) = aten::log_softmax(%47, %41), scope: XlaMNIST
  return (%49, %12, %13, %14, %21, %input.5, %26, %27, %28, %35, %x.1, %input.8, %self_size.2, %39, %103, %106, %input.10, %44, %63, %66, %47)

Backward:
graph(%0 : Float(32, 10),
      %1 : Float(32, 10, 24, 24),
      %2 : Float(32, 10, 12, 12),
      %3 : Float(32, 20, 8, 8),
      %4 : Float(32, 20, 4, 4),
      %5 : Float(32, 320),
      %6 : Float(320!, 50!),
      %7 : Float(32, 50),
      %8 : Float(50!, 10!),
      %9 : Float(32, 10),
      %input.1 : Float(32, 1, 28, 28),
      %11 : Float(10, 1, 5, 5),
      %12 : Float(20, 10, 5, 5),
      %13 : Float(32, 10, 24, 24),
      %14 : Tensor,
      %15 : Tensor,
      %16 : Long(32, 10, 12, 12),
      %input.4 : Float(32, 10, 12, 12),
      %18 : Float(32, 20, 8, 8),
      %19 : Tensor,
      %20 : Tensor,
      %21 : Long(32, 20, 4, 4),
      %x : Float(32, 20, 4, 4),
      %input.7 : Float(32, 320),
      %self_size.1 : int[],
      %25 : Float(320!, 50!),
      %26 : int[],
      %27 : int[],
      %input.9 : Float(32, 50),
      %29 : Float(50!, 10!),
      %30 : int[],
      %31 : int[],
      %32 : Float(32, 10),
      %33 : Float(32, 10)):
  %52 : int = prim::Constant[value=1]()
  %42 : int = prim::Constant[value=0]()
  %41 : int[] = prim::Constant[value=[2, 2]]()
  %40 : int[] = prim::Constant[value=[]]()
  %37 : bool = prim::Constant[value=0](), scope: XlaMNIST
  %36 : int[] = prim::Constant[value=[5, 5]]()
  %35 : int[] = prim::Constant[value=[1, 1]]()
  %34 : int[] = prim::Constant[value=[0, 0]]()
  %grad_self.1 : Tensor = aten::_log_softmax_backward_data(%0, %33, %52, %32)
  %126 : int = prim::Constant[value=1]()
  %127 : Float(32, 10) = aten::add(%9, %grad_self.1, %126)
  %61 : Tensor = aten::_grad_sum_to_size(%127, %30)
  %62 : Tensor = aten::mul(%127, %52)
  %63 : Tensor = aten::_grad_sum_to_size(%62, %31)
  %66 : Tensor = aten::t(%29)
  %grad_self.3 : Tensor = aten::mm(%63, %66)
  %68 : Tensor = aten::t(%input.9)
  %grad_mat2.1 : Tensor = aten::mm(%68, %63)
  %128 : int = prim::Constant[value=1]()
  %129 : Float(50, 10) = aten::add(%8, %grad_mat2.1, %128)
  %130 : int = prim::Constant[value=1]()
  %131 : Float(32, 50) = aten::add(%7, %grad_self.3, %130)
  %73 : Tensor = aten::t(%129)
  %146 : Tensor = aten::threshold_backward(%131, %input.9, %42)
  %80 : Tensor = aten::_grad_sum_to_size(%146, %26)
  %81 : Tensor = aten::mul(%146, %52)
  %82 : Tensor = aten::_grad_sum_to_size(%81, %27)
  %85 : Tensor = aten::t(%25)
  %grad_self.5 : Tensor = aten::mm(%82, %85)
  %87 : Tensor = aten::t(%input.7)
  %grad_mat2.3 : Tensor = aten::mm(%87, %82)
  %132 : int = prim::Constant[value=1]()
  %133 : Float(320, 50) = aten::add(%6, %grad_mat2.3, %132)
  %134 : int = prim::Constant[value=1]()
  %135 : Float(32, 320) = aten::add(%5, %grad_self.5, %134)
  %92 : Tensor = aten::t(%133)
  %94 : Tensor = aten::reshape(%135, %self_size.1)
  %136 : int = prim::Constant[value=1]()
  %137 : Float(32, 20, 4, 4) = aten::add(%4, %94, %136)
  %147 : Tensor = aten::threshold_backward(%137, %x, %42)
  %grad_self.7 : Tensor = aten::max_pool2d_with_indices_backward(%147, %18, %41, %40, %34, %35, %37, %21)
  %138 : int = prim::Constant[value=1]()
  %139 : Float(32, 20, 8, 8) = aten::add(%3, %grad_self.7, %138)
  %144 : bool[] = prim::Constant[value=[1, 1, 1]]()
  %grad_self.9 : Tensor, %grad_weight.1 : Tensor, %grad_bias.1 : Tensor = aten::thnn_conv2d_backward(%139, %input.4, %12, %36, %35, %34, %19, %20, %144)
  %140 : int = prim::Constant[value=1]()
  %141 : Float(32, 10, 12, 12) = aten::add(%2, %grad_self.9, %140)
  %148 : Tensor = aten::threshold_backward(%141, %input.4, %42)
  %grad_self.11 : Tensor = aten::max_pool2d_with_indices_backward(%148, %13, %41, %40, %34, %35, %37, %16)
  %142 : int = prim::Constant[value=1]()
  %143 : Float(32, 10, 24, 24) = aten::add(%1, %grad_self.11, %142)
  %145 : bool[] = prim::Constant[value=[1, 1, 1]]()
  %grad_self.13 : Tensor, %grad_weight.3 : Tensor, %grad_bias.3 : Tensor = aten::thnn_conv2d_backward(%143, %input.1, %11, %36, %35, %34, %14, %15, %145)
  return (%grad_weight.3, %grad_bias.3, %grad_weight.1, %grad_bias.1, %92, %80, %73, %61)

Removing output at index 3 from the forward graph
Removing output at index 2 from %12 : Float(32, 10, 24, 24), %13 : Tensor, %14 : Tensor = aten::thnn_conv2d_forward(%input.2, %1, %9, %2, %10, %11)

2019-04-06 02:50:37.040656: E tensorflow/compiler/xla/xla_client/tf_logging.cc:11] Check failed: df_input_vjps_it != gradient->df_input_vjps.end()

assignment issue

Minimal repro:

import torch
import torch_xla
import torch_xla_py.utils as xu
import torch_xla_py.xla_model as xm
import pdb
xla_device = xm.xla_device()
d = torch.ones((10, 4)).to(xla_device)
d[:, 0::2] = 2
pdb.set_trace()
print(d)

expected:

tensor([[2., 1., 2., 1.],
        [2., 1., 2., 1.],
        [2., 1., 2., 1.],
        [2., 1., 2., 1.],
        [2., 1., 2., 1.],
        [2., 1., 2., 1.],
        [2., 1., 2., 1.],
        [2., 1., 2., 1.],
        [2., 1., 2., 1.],
        [2., 1., 2., 1.]])

but got all ones.

Audit places which assume float element type

A few operations (threshold, batch norm) silently cast to float and they should be agnostic to float vs double. For example: https://github.com/pytorch/xla/blob/master/torch_xla/csrc/ops/threshold.h.

pytorch/xla build fails at 'name 'ProtoInfo' is not defined'

After building and installing PyTorch, ~/pytorch/xla$ python setup.py install fails with the following error under the following system environment,

ERROR: /home/user/.cache/bazel/_bazel_user/d8447ee09863213d5ccbcb3eeef5e515/external/io_bazel_rules_closure/closure/protobuf/closure_proto_library.bzl:66:21: name 'ProtoInfo' is not defined
ERROR: error loading package '': Extension 'closure/protobuf/closure_proto_library.bzl' has errors
ERROR: error loading package '': Extension 'closure/protobuf/closure_proto_library.bzl' has errors
INFO: Elapsed time: 3.034s
INFO: 0 processes.
FAILED: Build did NOT complete successfully (0 packages loaded)

System Environment:

PyTorch : v1.1.0
Bazel : 0.12
conda : 4.6.14
CUDA : 10.1

A quick search reveals this could be caused by using an older version of bazel. On trying bazel 0.13, 0.19.1 and 0.20.0, the build fails due to the same reason.

What am I missing here ? How can I work around this issue ?

Thank for any help !
Sanjay

Pred type causes issues

import torch
import torch_xla
import torch_xla_py.utils as xu
import torch_xla_py.xla_model as xm
import pdb
import torch.nn.functional as F
xla_device = xm.xla_device()
# xla_device = 'cpu'
a = (torch.rand(4).to(xla_device) >= 0.5)
b = (torch.rand(4).to(xla_device) >= 0.5)
c = torch.cat([a, b], dim=0)
d = c.sum()
pdb.set_trace()
print(d)

Build Instructions

I am trying to build pytorch-xla using the instructions provided in the README. However, they don't seem to be up to date as there is no .torch_commit_id file.

When I simply use the current Pytorch master branch, everything seems to work fine and compile. However, the setup.py script hangs (after building everything) at the following message:

building '_XLAC' extension
creating build/temp.linux-x86_64-3.7
creating build/temp.linux-x86_64-3.7/torch_xla
creating build/temp.linux-x86_64-3.7/torch_xla/csrc
creating build/temp.linux-x86_64-3.7/torch_xla/csrc/ops
creating build/temp.linux-x86_64-3.7/torch_xla/csrc/passes

Work with the latest PyTorch core

We're currently pinned to pytorch/pytorch@1ca0ec7. An attempt to use latest PyTorch breaks a significant number of tests: https://github.com/pytorch/xla/tree/asuhan/upgrade_pytorch.

Several issues need to be addressed to track latest PyTorch:

pytorch/pytorch#14825
pytorch/pytorch#14826
Work around batch norm counter update.
Figure out a way to address the constant aten::size results leaking out of graphs: https://gist.github.com/asuhan/9d4b21a5b0174caa0854bcd5f917afed. The probable cause for it is pytorch/pytorch#14485. While this makes sense for generalized traces, it is hard to handle in a principled way in our XLA bridge, especially when the forward and backward traces aren't fused.

Last but not least, we need to live at head and know when breaking changes (for us) to PyTorch happen. Setting up CI will at least let us know.

regarding summary writer + google colab + TPU

hello,

the SummaryWriter does not work when I do this in google colab

!pip install
http://storage.googleapis.com/pytorch-tpu-releases/tf-1.13/torch-1.0.0a0+1d94a2b-cp36-cp36m-linux_x86_64.whl
http://storage.googleapis.com/pytorch-tpu-releases/tf-1.13/torch_xla-0.1+5622d42-cp36-cp36m-linux_x86_64.whl

and when I upgrade torch,

import torch_xla gives error

is it possible to make both of them work, so I can use TPU on google colab, and use
SummaryWriter from torch.utils.Tensorboard, also is it possible to have one GAN example with TPU

Coral Edge TPU

Does this project work on the Coral Edge TPU (https://coral.withgoogle.com/products/accelerator) for inference, or is it only cloud TPUs?

E: Unable to locate package clang-7 E: Unable to locate package clang++-7 E: Couldn't find any package by regex 'clang++-7'

My VM is Debian system, and when I try to install clang-7 and clang++-7, it alerts as follows:

test@tputest:~$ sudo apt-get -y install clang-7 clang++-7
Reading package lists... Done
Building dependency tree       
Reading state information... Done
E: Unable to locate package clang-7
E: Unable to locate package clang++-7
E: Couldn't find any package by regex 'clang++-7'

How to solve this problem? Thanks!

torch.clamp produces wrong results

Minimal repro:

import torch
import torch_xla
import torch_xla_py.utils as xu
import torch_xla_py.xla_model as xm
import pdb
xla_device = xm.xla_device()


a = torch.randn(3,3)

c = 3.4
a = a.to(xla_device)

b = torch.clamp(a, max=c)
print(a)
print(b)

Example output:

tensor([[ 0.5138,  1.0475,  0.2356],
        [ 0.5381, -2.0426, -0.2799],
        [ 0.6123, -0.9173, -0.7700]])
tensor([[5.1376e-01, 1.0475e+00, 2.3559e-01],
        [5.3809e-01, 1.1755e-38, 1.1755e-38],
        [6.1232e-01, 1.1755e-38, 1.1755e-38]]) ### all e-38 are wrong

AtenXlaTensorTest.TestDropout fails on TPU

xla::RngUniform returns all zeros if the seed is 0. We need to set the seed here: https://github.com/pytorch/xla/blob/master/third_party/xla_client/xrt_computation_client.cc#L511 and here: https://github.com/pytorch/xla/blob/master/third_party/xla_client/xrt_computation_client.cc#L540.

metric report from mask rcnn inference run

Counter: aten::_softmax
  Value: 1
Counter: aten::_th_eq
  Value: 8
Counter: aten::_th_ge
  Value: 20
Counter: aten::_th_gt
  Value: 1
Counter: aten::_th_lt
  Value: 10
Counter: aten::_th_nonzero
  Value: 93
Counter: aten::s__th_and
  Value: 20
Counter: aten::thnn_conv_transpose2d_forward
  Value: 1
Counter: aten::upsample_nearest2d
  Value: 3

Add blas and lapack operations

Most of operations in https://pytorch.org/docs/stable/torch.html#blas-and-lapack-operations are either straightforward or have lowerings in the XLA client library (tensorflow/compiler/xla/client/lib/cholesky.{h, cc} and tensorflow/compiler/xla/client/lib/qr.{h, cc}.

int indexing

import torch
import torch_xla
import torch_xla_py.utils as xu
import torch_xla_py.xla_model as xm
import pdb
import torch.nn.functional as F
xla_device = xm.xla_device()
# xla_device = 'cpu'
v = torch.randn(5, 7, 3).to(xla_device)
pdb.set_trace()
print(v[:, [0, 4, 2]].shape) # Got (3, 5, 3), expect (5, 3, 3)

wrong shape for sum

import torch
import torch_xla
import torch.nn.functional as F
import torch_xla_py.utils as xu
import torch_xla_py.xla_model as xm
import pdb
xla_device = xm.xla_device()
# xla_device = 'cpu'
a = torch.rand(3, 4).to(xla_device)

pdb.set_trace()
print(a.sum().shape)

expect shape scalar tensor but got shape [4].

PyTorch->XLA interface refinement

@dlibenzi started a discussion with:

I'd like to get the general feeling for an interface we currently use in our tests, to hide the setup complexities (tracing, and other things typical to XLA) of a traced model.

Below is how an MNIST (resnet18/cifar10 is identical, modulus network and shapes) training test code looks like.

We have an XlaModel class, which takes the typical PyTorch nn.Module the users are accustomed to, input/target shapes and types (they default to float32), the loss function, and the devices (num_cores and devices are optional for one core setups).

We then have two main functions, train() and test() (code should be clear enough).

One advantage of having an wrapper class like this, is that, like mentioned before, it hides the tracing and setup code, so that users won't have to repeat it all over.

With its functionality, it can be plugged under the PyTorch data parallel API AFAICT.

More importantly, it hides the details of the optimization we currently do (and which we will do in the future), opening the way to reach top TPU performance with on device loops and infeeds.
Any thoughts?
Thanks!

def train_mnist():
    lr = 0.01 * FLAGS.num_cores
    momentum = 0.5
    log_interval = max(1, int(10 / FLAGS.num_cores))

    train_loader = torch.utils.data.DataLoader(
        datasets.MNIST(FLAGS.datadir, train=True, download=True,
                       transform=transforms.Compose([
                           transforms.ToTensor(),
                           transforms.Normalize((0.1307,), (0.3081,))
                       ])),
        batch_size=FLAGS.batch_size, shuffle=True, num_workers=FLAGS.num_workers)
    test_loader = torch.utils.data.DataLoader(
        datasets.MNIST(FLAGS.datadir, train=False, transform=transforms.Compose([
                           transforms.ToTensor(),
                           transforms.Normalize((0.1307,), (0.3081,))
                       ])),
        batch_size=FLAGS.batch_size, shuffle=True, num_workers=FLAGS.num_workers)

    model = MNIST()

    # Trace the model.                                                                                                                                                      devices = [':{}'.format(n) for n in range(0, FLAGS.num_cores)]
    xla_model = xmru.XlaModel(model, [FLAGS.batch_size, 1, 28, 28],
                              [FLAGS.batch_size], F.nll_loss,
                              num_cores=FLAGS.num_cores, devices=devices,
                              target_dtype=torch.int64)
    optimizer = optim.SGD(xla_model.parameters_list(), lr=lr, momentum=momentum)

    for epoch in range(0, FLAGS.num_epochs):
        xla_model.train(train_loader, optimizer, FLAGS.batch_size,
                        log_interval=log_interval)
accuracy = xla_model.test(test_loader,
                                  xmru.category_eval_fn(F.nll_loss),
                                  FLAGS.batch_size)
    return accuracy

Implement remaining unary ops

Hi,

I found a few unimplemented unary operations, and was wondering if I could help out with implementing them if you have plans for supporting them in the future.

I've listed them below (taken using an intersection of ops in XLA and PyTorch):

Thank you.

Proper way to install now on Colab/Linux, also "squeeze" gradient not implemented?

I'm aware this project is still under active development and not all things are ready yet, but I'd like to get some quick feedback on where things are. I've dug through the code and tried a couple of different things.

I first saw the Colab code (https://github.com/pytorch/xla/blob/master/contrib/colab/PyTorch_TPU_XRT_1_13.ipynb) and tried it on Colab. I saw some discrepancy between the code there and code elsewhere. I'm guessing the "train" method is no longer wrapped under XlaModel?

More importantly, the code there (MNIST) works, but once I try using it on my own model (a transformer, BERT), it throws an error:

RuntimeError: differentiation of aten::squeeze is not supported, or it is missing necessary type information

This is weird since I then looked through this repo and found "squeeze.cpp" implemented under xla/torch_xla/csrc/ops/

So I thought maybe the Colab repos (http://storage.googleapis.com/pytorch-tpu-releases/tf-1.13/torch-1.0.0a0+1d94a2b-cp36-cp36m-linux_x86_64.whl) are out-dated and I then looked under http://storage.googleapis.com/pytorch-tpu-releases/ and found a lot of new releases, but they require compilers for python 3.5 and I couldn't install them on Colab.

So, I then tried installing them on a linux machine (Google's Cloud TPU). Interestingly, I also had to pin down MKL to older version like kokoro/ubuntu/common.sh, otherwise importing torch would throw an error about MKL. This time, however, I can't even get the colab code to work. It throws this message:

RuntimeError: tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:109 : Check failed: session_work.first->session()->Run( session_
work.second.feed_inputs, session_work.second.outputs_handles, &outputs) == ::tensorflow::Status::OK() (Not found: Op type not registered 'X
RTAllocateFromTensor' in binary running on n-80018309-w-0. Make sure the Op and Kernel are registered in the binary running in this process
. Note that if you are loading a saved graph which used ops from tf.contrib, accessing (e.g.) tf.contrib.resampler should be done before
importing the graph, as contrib ops are lazily registered when the module is first accessed. vs. OK)

I'm guessing this is saying I should install from source?

So finally, I then tried installing from source following the directions on README.md. I got Pytorch to compile, but got the following message when trying to compile xla:

ERROR: /home/cmw025/pytorch/xla/third_party/tensorflow/tensorflow/core/kernels/BUILD:3371:1: C++ compilation of rule '//tensorflow/
core/kernels:reduction_ops' failed (Exit 1)
In file included from external/eigen_archive/unsupported/Eigen/CXX11/Tensor:124:0,
from ./third_party/eigen3/unsupported/Eigen/CXX11/Tensor:1,
from ./tensorflow/core/kernels/reduction_ops_common.h:27,
from tensorflow/core/kernels/reduction_ops_sum.cc:16:
external/eigen_archive/unsupported/Eigen/CXX11/src/Tensor/TensorReduction.h: In static member function 'static void std::Function
handler<void(_ArgTypes ...), _Functor>::_M_invoke(const std::_Any_data&, _ArgTypes&& ...) [with _Functor = Eigen::internal::TensorE
xecutor<Expression, Eigen::ThreadPoolDevice, Vectorizable, Tileable>::run(const Expression&, const Eigen::ThreadPoolDevice&) [with
Expression = const Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<std::complex, 0, 1, long int>, 16, Eigen::MakePointe
r>, const Eigen::TensorReductionOp<Eigen::internal::SumReducer<std::complex >, const Eigen::IndexList<Eigen::type2index<0l>

, const Eigen::TensorMap<Eigen::Tensor<const std::complex, 1, 1, long int>, 16, Eigen::MakePointer>, Eigen::MakePointer> >;
bool Vectorizable = true; bool Tileable = false]::<lambda(Eigen::internal::TensorExecutor<const Eigen::TensorAssignOp<Eigen::Tenso
rMap<Eigen::Tensor<std::complex, 0, 1, long int>, 16, Eigen::MakePointer>, const Eigen::TensorReductionOp<Eigen::internal::S
umReducer<std::complex >, const Eigen::IndexList<Eigen::type2index<0l> >, const Eigen::TensorMap<Eigen::Tensor<const std::co
mplex, 1, 1, long int>, 16, Eigen::MakePointer>, Eigen::MakePointer> >, Eigen::ThreadPoolDevice, true, false>::StorageIndex,
Eigen::internal::TensorExecutor<const Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<std::complex, 0, 1, long int>, 1
6, Eigen::MakePointer>, const Eigen::TensorReductionOp<Eigen::internal::SumReducer<std::complex >, const Eigen::IndexList<Ei
gen::type2index<0l> >, const Eigen::TensorMap<Eigen::Tensor<const std::complex, 1, 1, long int>, 16, Eigen::MakePointer>, Ei
gen::MakePointer> >, Eigen::ThreadPoolDevice, true, false>::StorageIndex)>; _ArgTypes = {long int, long int}]':
external/eigen_archive/unsupported/Eigen/CXX11/src/Tensor/TensorReduction.h:801:9: internal compiler error: in emit_move_insn, at e
xpr.c:3547
values[i] = internal::InnerMostDimReducer<Self, Op>::reduce(*this, firstIndex + i * num_values_to_reduce,
^~~~~~
Please submit a full bug report,
with preprocessed source if appropriate.
See file:///usr/share/doc/gcc-6/README.Bugs for instructions.
Target //tensorflow/compiler/xla/xla_client:libxla_computation_client.so failed to build
Use --verbose_failures to see the command lines of failed build steps.
INFO: Elapsed time: 1306.383s, Critical Path: 52.54s
INFO: 736 processes: 736 local.
FAILED: Build did NOT complete successfully

I also get this bazel warning, not sure if that's what's causing the problem:

WARNING: build_bazel_rules_apple depends on bazel_skylib loaded from https://github.com/bazelbuild/bazel-skylib.git (tag 0.6.0)
, but we have detected it already loaded into your workspace from None (tag None). You may run into compatibility issues. To silenc
e this warning, pass ignore_version_differences = True to apple_rules_dependencies().

I'm also a little confused, since the python script (setup.py) doesn't seem to make use of the kokoro/ubuntu/common.sh script. How do you actually build now? Any advice is appreciated. Thank you so much.

segfaults in randperm

import torch
import torch_xla
import torch_xla_py.utils as xu
import torch_xla_py.xla_model as xm
import pdb
xla_device = xm.xla_device()
# xla_device = 'cpu'
pdb.set_trace()
p = torch.randperm(3, device=xla_device)
print(p)

updating the torch and torch_xla wheels in the colab notebook

The pip libraries listed here seem to be outdated. (Also discussed with @asuhan on slack and #528 ) I am using the nightly builds of tf (1.14.1).

I get the following error when importing torch:
ImportError: libcudart.so.10.0: cannot open shared object file: No such file or directory

Seems like these are compiled using CUDA libraries. Do you have the corresponding CPU versions?

Also is there an official page where you list the nightly builds of torch_xla?

I also tried building from source but it didnt help either.

zeros_like with dtype

import torch
import torch_xla
import torch_xla_py.utils as xu
import torch_xla_py.xla_model as xm
import pdb
# xla_device = xm.xla_device()
xla_device = 'cpu'
d = torch.rand(3, 3).to(xla_device)
a = torch.zeros_like(d, dtype=torch.int8)
pdb.set_trace()
print(a)

Even when xla_device=cpu, the script above still complains about TypeError: _torch_xla_zeros_like() got an unexpected keyword argument 'dtype'.
And torch.zeros_like with dtype should be supported.

reduction on preds

import torch
import torch_xla
import torch_xla_py.utils as xu
import torch_xla_py.xla_model as xm
import pdb
xla_device = xm.xla_device()
# xla_device = 'cpu'
a = torch.rand(61440).to(xla_device) - 0.5
pdb.set_trace()
print((a < 0).sum())
print((a == 0).sum())
print((a > 0).sum())

Negative index of tensor

Minimal repro:

import torch
import torch_xla
import torch_xla_py.utils as xu
import torch_xla_py.xla_model as xm

a = torch.randn(3, 800, 1066)
xla_device = xm.xla_device()

b = a.to(xla_device)
print(b[:, :, -1])

[XLATensor] Support for the indices output of max_pool2d_with_indices

Currently, trying to access the returned indices tensor would throw a "Node not supported" exception. Our implementation of max_pool2d_with_indices_backward doesn't use it, but an user could trigger the problem by running max_pool2d with return_indices=True. We want to have a fallback path instead of throwing when indices are requested -- this can be an XLA CustomCall or an actual lowering.

slice issue

import torch
import torch_xla
import torch_xla_py.utils as xu
import torch_xla_py.xla_model as xm
import pdb
# xla_device = xm.xla_device()
xla_device = 'cpu'
d = torch.ones((1000, 324)).to(xla_device)
dw = d[:, 2::4]
print(dw.shape)
dw = torch.clamp(dw, max=3.1)
pdb.set_trace()
print(dw.shape)

Fix batch normalization with momentum

Right now we're not using the momentum in the XLA lowering.

torch.min/max broadcasting

import torch
import torch_xla
import torch_xla_py.utils as xu
import torch_xla_py.xla_model as xm
import pdb
xla_device = xm.xla_device()
# xla_device = 'cpu'
d = torch.rand(3, 1, 2).to(xla_device)
a = torch.rand(4, 2).to(xla_device)
b = torch.max(a, d)
pdb.set_trace()
print(b)

Expected: b should be of shape [3, 4, 2].

pytorch / xla Goto Github PK

xla's Introduction

PyTorch/XLA

Getting Started

PyTorch/XLA tutorials

Available docker images and wheels

Python packages

Docker

Troubleshooting

Providing Feedback

Contributing

Disclaimer

Additional Reads

Related Projects

xla's People

Contributors

Stargazers

Watchers

Forkers

xla's Issues

Recommend Projects

Recommend Topics

Recommend Org