Giter VIP home page Giter VIP logo

Comments (6)

JackCaoG avatar JackCaoG commented on July 30, 2024 1

@will-cromar FYI, @wonjoolee95 too since you are fixing the similar issue for our gpu whls

from xla.

wonjoolee95 avatar wonjoolee95 commented on July 30, 2024

This is helpful, thanks for the info! I'm able to reproduce:

# Fails
wonjoo@t1v-n-b72eb559-w-0:~$ pip install https://storage.googleapis.com/pytorch-xla-releases/wheels/cuda/12.1/torch_xla-nightly-cp311-cp311-linux_x86_64.whl
Defaulting to user installation because normal site-packages is not writeable
ERROR: Invalid requirement: 'torch-xla==nightly': Expected end or semicolon (after name and no valid version specifier)
    torch-xla==nightly
             ^
# Works
wonjoo@t1v-n-b72eb559-w-0:~$ pip install "pip<24"
Defaulting to user installation because normal site-packages is not writeable
Collecting pip<24
  Downloading pip-23.3.2-py3-none-any.whl.metadata (3.5 kB)
Downloading pip-23.3.2-py3-none-any.whl (2.1 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.1/2.1 MB 34.0 MB/s eta 0:00:00
WARNING: Error parsing dependencies of distro-info: Invalid version: '1.1build1'
WARNING: Error parsing dependencies of python-debian: Invalid version: '0.1.43ubuntu1'
Installing collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 24.1.2
    Uninstalling pip-24.1.2:
      Successfully uninstalled pip-24.1.2
  WARNING: The scripts pip, pip3 and pip3.10 are installed in '/home/wonjoo/.local/bin' which is not on PATH.
  Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
Successfully installed pip-23.3.2

I think it's better if we do pip install "pip<24" to fix our GPU wheels asap, and then come up with a more long term solution. @will-cromar, do you know where would be the correct place to have this pip install "pip<24" command in our /infra files?

from xla.

will-cromar avatar will-cromar commented on July 30, 2024

Is this issue actually what's causing our build breakage? Why are the TPU builds passing but not the GPU builds? The most recent failures I see there are this:

Step #2 - "build_xla_docker_image":     ERROR: An error occurred during the fetch of repository 'go_sdk':
Step #2 - "build_xla_docker_image":        Traceback (most recent call last):
Step #2 - "build_xla_docker_image":             File "/root/.cache/bazel/_bazel_root/2ba57cc32d8c1f12152416615363d16d/external/io_bazel_rules_go/go/private/sdk.bzl", line 101, column 16, in _go_download_sdk_impl
Step #2 - "build_xla_docker_image":                     _remote_sdk(ctx, [url.format(filename) for url in ctx.attr.urls], ctx.attr.strip_prefix, sha256)
Step #2 - "build_xla_docker_image":             File "/root/.cache/bazel/_bazel_root/2ba57cc32d8c1f12152416615363d16d/external/io_bazel_rules_go/go/private/sdk.bzl", line 209, column 21, in _remote_sdk
Step #2 - "build_xla_docker_image":                     ctx.download(
Step #2 - "build_xla_docker_image":     Error in download: java.io.IOException: Error downloading [https://dl.google.com/go/go1.18.4.linux-amd64.tar.gz] to /root/.cache/bazel/_bazel_root/2ba57cc32d8c1f12152416615363d16d/external/go_sdk/go_sdk.tar.gz: Bytes read 127925296 but wanted 141812725
Step #2 - "build_xla_docker_image":     ERROR: /src/pytorch/xla/WORKSPACE:136:15: fetching _go_download_sdk rule //external:go_sdk: Traceback (most recent call last):
Step #2 - "build_xla_docker_image":             File "/root/.cache/bazel/_bazel_root/2ba57cc32d8c1f12152416615363d16d/external/io_bazel_rules_go/go/private/sdk.bzl", line 101, column 16, in _go_download_sdk_impl
Step #2 - "build_xla_docker_image":                     _remote_sdk(ctx, [url.format(filename) for url in ctx.attr.urls], ctx.attr.strip_prefix, sha256)
Step #2 - "build_xla_docker_image":             File "/root/.cache/bazel/_bazel_root/2ba57cc32d8c1f12152416615363d16d/external/io_bazel_rules_go/go/private/sdk.bzl", line 209, column 21, in _remote_sdk
Step #2 - "build_xla_docker_image":                     ctx.download(
Step #2 - "build_xla_docker_image":     Error in download: java.io.IOException: Error downloading [https://dl.google.com/go/go1.18.4.linux-amd64.tar.gz] to /root/.cache/bazel/_bazel_root/2ba57cc32d8c1f12152416615363d16d/external/go_sdk/go_sdk.tar.gz: Bytes read 127925296 but wanted 141812725
Step #2 - "build_xla_docker_image":     ERROR: Analysis of target '//:_XLAC.so' failed; build aborted: java.io.IOException: Error downloading [https://dl.google.com/go/go1.18.4.linux-amd64.tar.gz] to /root/.cache/bazel/_bazel_root/2ba57cc32d8c1f12152416615363d16d/external/go_sdk/go_sdk.tar.gz: Bytes read 127925296 but wanted 141812725

Even if we can hack our build, this is a client issue. Nobody who updated their pip recently would be able to install our wheels, because the rename we're doing is no longer actually valid.

The build version we set is defined by some combination of these environment variables: https://github.com/pytorch/xla/blob/master/infra/ansible/config/env.yaml

I think TORCH_XLA_VERSION and GIT_VERSIONED_XLA_BUILD are the important ones, but you'll have to review setup.py to see how we set version exactly. That version name is probably still valid like torch_xla-2.5.0+git41d998d. The problem is, we rename the wheels with the nightly date here:

- name: Rename and append +YYYYMMDD suffix to nightly wheels
ansible.builtin.shell: |
pushd /tmp/staging-wheels
cp {{ item.dir }}/*.whl .
rename -v "s/^{{ item.prefix }}-(.*?)-cp/{{ item.prefix }}-nightly-cp/" *.whl
mv /tmp/staging-wheels/* /dist/
popd
rename -v "s/^{{ item.prefix }}-(.*?)-cp/{{ item.prefix }}-nightly+$(date -u +%Y%m%d)-cp/" *.whl
args:
executable: /bin/bash
chdir: "{{ item.dir }}"
loop:
- { dir: "{{ (src_root, 'pytorch/dist') | path_join }}", prefix: "torch" }
- { dir: "{{ (src_root, 'pytorch/xla/dist') | path_join }}", prefix: "torch_xla" }
when: nightly_release

We need to at least change that rename to one of the valid patterns like @fellhorn suggested or copy the pattern used by torch (e.g. torch-X.Y.Z.devYYYYMMDD)

You can dry run the ansible workflow with a command like this one:

ansible-playbook playbook.yaml -vvv -e "stage=build arch=amd64 accelerator=tpu src_root=${GITHUB_WORKSPACE} bundle_libtpu=0 build_cpp_tests=1 git_versioned_xla_build=1 cache_suffix=-ci" --skip-tags=fetch_srcs,install_deps

Anything that gets written to /dist is what we will upload to GCS.

from xla.

JackCaoG avatar JackCaoG commented on July 30, 2024

@zpcore can you made the rename logic that @will-cromar mentioned above since you are offcall this week? It should just be a one line change but then we need to update README to reflect the new format.

from xla.

mfatih7 avatar mfatih7 commented on July 30, 2024

Hello all

As a general comment:

When users find errors in pytorch-xla developers fix it in nightly releases and ask the users to test them.
But generating an environment with compatible torch-xla, torch, and torch vision is not straightforward as told here.

This issue is one example of it.
I hope you provide a better way for users to test the nightly updates easily.

from xla.

zpcore avatar zpcore commented on July 30, 2024

Hello all

As a general comment:

When users find errors in pytorch-xla developers fix it in nightly releases and ask the users to test them. But generating an environment with compatible torch-xla, torch, and torch vision is not straightforward as told here.

This issue is one example of it. I hope you provide a better way for users to test the nightly updates easily.

Thanks for the feedback, I think we are missing to provide example commands to install compatible torch, torch[vision,audio], torch_xla for the cuda. We will make the document update. For now, you can use e.g.,:

pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu121
pip3 install https://storage.googleapis.com/pytorch-xla-releases/wheels/cuda/12.1/torch_xla-nightly-cp310-cp310-linux_x86_64.whl

In general, this should be compatible.

from xla.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.