rocm / apex Goto Github PK

This project forked from nvidia/apex

A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch

License: BSD 3-Clause "New" or "Revised" License

Python 47.53% Shell 0.23% C++ 24.08% Cuda 28.01% C 0.15% Dockerfile 0.01%

apex's Introduction

Introduction

This repository holds NVIDIA-maintained utilities to streamline mixed precision and distributed training in Pytorch. Some of the code here will be included in upstream Pytorch eventually. The intent of Apex is to make up-to-date utilities available to users as quickly as possible.

Full API Documentation: https://nvidia.github.io/apex

GTC 2019 and Pytorch DevCon 2019 Slides

1. Amp: Automatic Mixed Precision

apex.amp is a tool to enable mixed precision training by changing only 3 lines of your script. Users can easily experiment with different pure and mixed precision training modes by supplying different flags to amp.initialize.

Webinar introducing Amp (The flag cast_batchnorm has been renamed to keep_batchnorm_fp32).

API Documentation

Comprehensive Imagenet example

DCGAN example coming soon...

Moving to the new Amp API (for users of the deprecated "Amp" and "FP16_Optimizer" APIs)

2. Distributed Training

apex.parallel.DistributedDataParallel is a module wrapper, similar to torch.nn.parallel.DistributedDataParallel. It enables convenient multiprocess distributed training, optimized for NVIDIA's NCCL communication library.

API Documentation

Python Source

Example/Walkthrough

The Imagenet example shows use of apex.parallel.DistributedDataParallel along with apex.amp.

Synchronized Batch Normalization

apex.parallel.SyncBatchNorm extends torch.nn.modules.batchnorm._BatchNorm to support synchronized BN. It allreduces stats across processes during multiprocess (DistributedDataParallel) training. Synchronous BN has been used in cases where only a small local minibatch can fit on each GPU. Allreduced stats increase the effective batch size for the BN layer to the global batch size across all processes (which, technically, is the correct formulation). Synchronous BN has been observed to improve converged accuracy in some of our research models.

Checkpointing

To properly save and load your amp training, we introduce the amp.state_dict(), which contains all loss_scalers and their corresponding unskipped steps, as well as amp.load_state_dict() to restore these attributes.

In order to get bitwise accuracy, we recommend the following workflow:

# Initialization
opt_level = 'O1'
model, optimizer = amp.initialize(model, optimizer, opt_level=opt_level)

# Train your model
...
with amp.scale_loss(loss, optimizer) as scaled_loss:
    scaled_loss.backward()
...

# Save checkpoint
checkpoint = {
    'model': model.state_dict(),
    'optimizer': optimizer.state_dict(),
    'amp': amp.state_dict()
}
torch.save(checkpoint, 'amp_checkpoint.pt')
...

# Restore
model = ...
optimizer = ...
checkpoint = torch.load('amp_checkpoint.pt')

model, optimizer = amp.initialize(model, optimizer, opt_level=opt_level)
model.load_state_dict(checkpoint['model'])
optimizer.load_state_dict(checkpoint['optimizer'])
amp.load_state_dict(checkpoint['amp'])

# Continue training
...

Note that we recommend restoring the model using the same opt_level. Also note that we recommend calling the load_state_dict methods after amp.initialize.

Installation

Containers

NVIDIA PyTorch Containers are available on NGC: https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch. The containers come with all the custom extensions available at the moment.

See the NGC documentation for details such as:

how to pull a container
how to run a pulled container
release notes

From Source

To install Apex from source, we recommend using the nightly Pytorch obtainable from https://github.com/pytorch/pytorch.

The latest stable release obtainable from https://pytorch.org should also work.

Rocm

Apex on ROCm supports both python only build and extension build. Note: Pytorch version recommended is >=1.5 for extension build.

To install using python only build use the following command in apex folder:

python setup.py install

=======

Supported Versions

`APEX Version`	`APEX branch`	`Torch Version`
`1.3.0`	master	`2.3`
`1.2.0`	release/1.2.0	`2.2`
`1.1.0`	release/1.1.0	`2.1`
`1.0.0`	release/1.0.0	`2.0` and older

The relation between APEX and ROCm PyTorch is maintained in file related_commits in ROCm PyTorch release branches in the following format.

ubuntu|pytorch|apex|release/1.0.0|06c33eee43f7a22f3ed7d9c3e5be0ddd757dc345|https://github.com/ROCmSoftwarePlatform/apex
centos|pytorch|apex|release/1.0.0|06c33eee43f7a22f3ed7d9c3e5be0ddd757dc345|https://github.com/ROCmSoftwarePlatform/apex

To install using extensions enabled use the following command in apex folder:

# if pip >= 23.1 (ref: https://pip.pypa.io/en/stable/news/#v23-1) which supports multiple `--config-settings` with the same key...
pip install -v --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./
# otherwise
python setup.py install --cpp_ext --cuda_ext

Note that using --cuda_ext flag to install Apex will also enable all the extensions supported on ROCm including "--distributed_adam", "--distributed_lamb", "--bnp", "--xentropy", "--deprecated_fused_adam", "--deprecated_fused_lamb", and "--fast_multihead_attn".

Linux

For performance and full functionality, we recommend installing Apex with CUDA and C++ extensions via

git clone https://github.com/NVIDIA/apex
cd apex
# if pip >= 23.1 (ref: https://pip.pypa.io/en/stable/news/#v23-1) which supports multiple `--config-settings` with the same key...
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./
# otherwise
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --global-option="--cpp_ext" --global-option="--cuda_ext" ./

Apex also supports a Python-only build via

pip install -v --disable-pip-version-check --no-build-isolation --no-cache-dir ./

A Python-only build omits:

Fused kernels required to use apex.optimizers.FusedAdam.
Fused kernels required to use apex.normalization.FusedLayerNorm and apex.normalization.FusedRMSNorm.
Fused kernels that improve the performance and numerical stability of apex.parallel.SyncBatchNorm.
Fused kernels that improve the performance of apex.parallel.DistributedDataParallel and apex.amp. DistributedDataParallel, amp, and SyncBatchNorm will still be usable, but they may be slower.

[Experimental] Windows

pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" . may work if you were able to build Pytorch from source on your system. A Python-only build via pip install -v --no-cache-dir . is more likely to work.
If you installed Pytorch in a Conda environment, make sure to install Apex in that same environment.

apex's People

Contributors

Stargazers

Watchers

Forkers

ashishfarmer jithunnair-amd sryap pruthvistony alexshuang barathum000 guolinke bmedishe mahathi-vatsal abhinavvishnu irlyngaas alugorey mayank31398 denverzyl

apex's Issues

test_multi_device (test_lamb.TestFusedMixedPrecisionLamb)

Reference: #78 (comment)

Failing tests in --peer_memory

Please find the comment in the PR we enabled --peer_memory and --nccl_p2p extensions: #87 (comment)

Some tests failed sporadically on ROCm by running the following test script:

cd apex/contrib/peer_memory
torchrun --nproc_per_node 2 peer_halo_exchange_module_tests.py

Problems building apex with ROCm-5.4, 5.5, and 5.6

Describe the Bug
The latest master branch fails to build with several ROCm versions, including 5.4, 5.5, and 5.6.

Rolling back to the commit made on June 20 (git checkout 10c7482) allows ROCm-5.4 to build. The build still fails for 5.5 and 5.6 but with a different error.

Minimal Steps/Code to Reproduce the Bug

For ROCm-5.4.3, I use the following to build:

virtualenv --system-site-packages env
source env/bin/activate

pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm5.4.2

git clone --recursive https://github.com/ROCmSoftwarePlatform/apex.git
cd apex

export DISTUTILS_DEBUG=1
export __HIP_PLATFORM_HCC__
export __HIP_PLATFORM_AMD__
export HCC_AMDGPU_TARGET=gfx90a
export PYTORCH_ROCM_ARCH=gfx90a
export ROCM_HOME=/opt/rocm-5.4.3
export CC=gcc
export CXX=g++
pip3 install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./

The build fails when compiling csrc/mlp_hip.hip with errors like the following:

  csrc/mlp_hip.hip:65:53: error: unknown type name 'hipblasOperation_t'; did you mean 'hipsparseOperation_t'?
  static rocblas_operation hipOperationToRocOperation(hipblasOperation_t op)
                                                      ^~~~~~~~~~~~~~~~~~
                                                      hipsparseOperation_t
  /opt/rocm-5.4.3/include/hipsparse/hipsparse.h:317:3: note: 'hipsparseOperation_t' declared here
  } hipsparseOperation_t;
    ^
  csrc/mlp_hip.hip:69:10: error: use of undeclared identifier 'HIPBLAS_OP_N'
      case HIPBLAS_OP_N:
           ^
  csrc/mlp_hip.hip:71:10: error: use of undeclared identifier 'HIPBLAS_OP_T'
      case HIPBLAS_OP_T:
           ^
  csrc/mlp_hip.hip:73:10: error: use of undeclared identifier 'HIPBLAS_OP_C'
      case HIPBLAS_OP_C:
           ^
  csrc/mlp_hip.hip:79:8: error: unknown type name 'hipblasStatus_t'; did you mean 'hipsparseStatus_t'?
  static hipblasStatus_t rocBLASStatusToHIPStatus(rocblas_status error)
         ^~~~~~~~~~~~~~~
         hipsparseStatus_t
  /opt/rocm-5.4.3/include/hipsparse/hipsparse.h:188:3: note: 'hipsparseStatus_t' declared here
  } hipsparseStatus_t;
    ^

Rolling back to the commit from June 20 allows the build to complete:

cd apex

git checkout 10c7482
git submodule init
git submodule update

export DISTUTILS_DEBUG=1
export __HIP_PLATFORM_HCC__
export __HIP_PLATFORM_AMD__
export HCC_AMDGPU_TARGET=gfx90a
export PYTORCH_ROCM_ARCH=gfx90a
export ROCM_HOME=/opt/rocm-5.4.3
export CC=gcc
export CXX=g++
pip3 install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./

Building apex from master with ROCm-5.5 and ROCm-5.6 fail with errors similar to each other, but errors that are distinct from ROCm-5.4. Here are the steps I used to build with ROCm-5.6:

virtualenv --system-site-packages env
source env/bin/activate

pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm5.6

git clone --recursive https://github.com/ROCmSoftwarePlatform/apex.git
cd apex

export DISTUTILS_DEBUG=1
export __HIP_PLATFORM_HCC__
export __HIP_PLATFORM_AMD__
export HCC_AMDGPU_TARGET=gfx90a
export PYTORCH_ROCM_ARCH=gfx90a
export ROCM_HOME=/opt/rocm-5.6.0
export CC=gcc
export CXX=g++
pip3 install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./

That fails with the following error:

  csrc/mlp_hip.hip:91:10: error: use of undeclared identifier 'rocblas_status_excluded_from_build'
      case rocblas_status_excluded_from_build:
           ^
  csrc/mlp_hip.hip:104:10: error: use of undeclared identifier 'rocblas_status_arch_mismatch'; did you mean 'rocblas_status_size_query_mismatch'?
      case rocblas_status_arch_mismatch:
           ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
           rocblas_status_size_query_mismatch
  /opt/rocm-5.6.0/include/rocblas/internal/rocblas-types.h:212:5: note: 'rocblas_status_size_query_mismatch' declared here
      rocblas_status_size_query_mismatch = 8, /**< Unmatched start/stop size query */
      ^
  csrc/mlp_hip.hip:104:10: error: duplicate case value 'rocblas_status_size_query_mismatch'
      case rocblas_status_arch_mismatch:
           ^
  csrc/mlp_hip.hip:96:10: note: previous case defined here
      case rocblas_status_size_query_mismatch:
           ^

In this case, rolling back to the June 20 commit fails with a different error:

  csrc/mlp_hip.hip:89:7: error: use of undeclared identifier 'rocblas_datatype_f64_r'
        rocblas_datatype_f64_r,
        ^
  csrc/mlp_hip.hip:92:7: error: use of undeclared identifier 'rocblas_datatype_f64_r'
        rocblas_datatype_f64_r,
        ^
  csrc/mlp_hip.hip:96:7: error: use of undeclared identifier 'rocblas_datatype_f64_r'
        rocblas_datatype_f64_r,
        ^
  csrc/mlp_hip.hip:99:7: error: use of undeclared identifier 'rocblas_datatype_f64_r'
        rocblas_datatype_f64_r,
        ^
  csrc/mlp_hip.hip:101:7: error: use of undeclared identifier 'rocblas_datatype_f64_r'
        rocblas_datatype_f64_r,
        ^
  csrc/mlp_hip.hip:102:7: error: use of undeclared identifier 'rocblas_gemm_algo_standard'
        rocblas_gemm_algo_standard,
        ^

Building with the June 20 commit, I see that the csrc/mlp_hip.hip file contains the following for ROCm-5.5 and ROCm-5.6 (which fails):

/* Includes, cuda */
#include <hipblas/hipblas.h>
#include <hip/hip_runtime.h>

but it has the following for ROCm-5.4 (which builds):

/* Includes, cuda */
#include <rocblas/rocblas.h>
#include <hip/hip_runtime.h>

Expected Behavior

Environment

Errors when building on an MI250x server with ROCm 5.7 and PyTorch 2.2.1

Describe the Bug

When installing on an MI250x server with ROCm 5.7 and PyTorch 2.2.1, I obtained the following errors:

 /home/user/code/apex/csrc/fused_dense_hip.hip:63:7: error: use of undeclared identifier 'CUBLAS_COMPUTE_64F'
        CUBLAS_COMPUTE_64F,
        ^
  /home/user/code/apex/csrc/fused_dense_hip.hip:101:7: error: use of undeclared identifier 'CUBLAS_COMPUTE_32F'
        CUBLAS_COMPUTE_32F,
        ^
  /home/user/code/apex/csrc/fused_dense_hip.hip:139:7: error: use of undeclared identifier 'CUBLAS_COMPUTE_16F'
        CUBLAS_COMPUTE_16F,
        ^
  3 errors generated when compiling for gfx90a.

Minimal Steps/Code to Reproduce the Bug

git clone https://github.com/ROCm/apex.git
cd apex
git checkout release/1.2.0
export HCC_AMDGPU_TARGET=gfx90a
export PYTORCH_ROCM_ARCH=gfx90a
export GPU_ARCHS=gfx90a
export MAX_JOBS=8 
pip install -v --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./ --user

Expected Behavior

Environment

Collecting environment information...
PyTorch version: 2.2.1+rocm5.7
Is debug build: False
CUDA used to build PyTorch: N/A
ROCM used to build PyTorch: 5.7.31921-d1770ee1b

OS: Red Hat Enterprise Linux release 8.8 (Ootpa) (x86_64)
GCC version: (GCC) 12.2.1 20221121 (Red Hat 12.2.1-7)
Clang version: 15.0.7 (Red Hat 15.0.7-1.module+el8.8.0+17939+b58878af)
CMake version: version 3.28.3
Libc version: glibc-2.28

Python version: 3.10.10 (main, Apr 14 2023, 19:33:04) [GCC 10.3.1 20210422 (Red Hat 10.3.1-1)] (64-bit runtime)
Python platform: Linux-4.18.0-477.10.1.el8_8.x86_64-x86_64-with-glibc2.28
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: AMD Instinct MI250X (gfx90a:sramecc+:xnack-)
Nvidia driver version: Could not collect
cuDNN version: Could not collect
HIP runtime version: 5.7.31921
MIOpen runtime version: 2.20.0
Is XNNPACK available: True

CPU:
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              128
On-line CPU(s) list: 0-127
Thread(s) per core:  2
Core(s) per socket:  64
Socket(s):           1
NUMA node(s):        4
Vendor ID:           AuthenticAMD
CPU family:          25
Model:               48
Model name:          AMD EPYC 7A53 64-Core Processor
Stepping:            1
CPU MHz:             2000.000
CPU max MHz:         3541.0149
CPU min MHz:         1500.0000
BogoMIPS:            3992.70
Virtualization:      AMD-V
L1d cache:           32K
L1i cache:           32K
L2 cache:            512K
L3 cache:            32768K
NUMA node0 CPU(s):   0-15,64-79
NUMA node1 CPU(s):   16-31,80-95
NUMA node2 CPU(s):   32-47,96-111
NUMA node3 CPU(s):   48-63,112-127
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 invpcid_single hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr wbnoinvd amd_ppin arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca fsrm

Versions of relevant libraries:
[pip3] numpy==1.23.5
[pip3] pytorch-triton-rocm==2.2.0
[pip3] torch==2.2.1+rocm5.7
[pip3] torchaudio==2.2.1+rocm5.7
[pip3] torching==0.0.1
[pip3] torchvision==0.17.1+rocm5.7
[conda] Could not collect

Is RoCm apex.amp deprecated & behavior mismatch vs NVIDIA APEX

Hi, I am wondering if RoCm apex.amp is deprecated? NVIDIA APEX has some deprecation warnings that are not present in this repo: https://github.com/NVIDIA/apex/pull/1506/files

Moreover, I realize that this code

import torch
import torch.nn as nn
from apex import amp

class MyModule(nn.Module):
    def __init__(self):
        super().__init__()
        self.lin = nn.Linear(3, 4)

    def forward(self, attn_probs, value_states):
        attn_output = torch.bmm(attn_probs, value_states)
        return attn_output

from torch.optim import AdamW

model = MyModule().to("cuda")
optimizer = AdamW(model.parameters())

model, optimizer = amp.initialize(model, optimizer, opt_level="O1")

attn_probs = torch.rand(4, 16, 16).to("cuda")
value_states = torch.rand(4, 16, 2).to(torch.float16).to("cuda")

attn_output = model(attn_probs, value_states)

runs fine with NVIDA APEX but fails on RoCm APEX with the following log:

Traceback (most recent call last):
  File "run_bmm.py", line 26, in <module>
    attn_output = model(attn_probs, value_states)
  File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "run_bmm.py", line 11, in forward
    attn_output = torch.bmm(attn_probs, value_states)
RuntimeError: expected scalar type Half but found Float

However, using torch.cuda.amp.autocast instead works fine for both RoCm and CUDA-powered devices (with torch 2.0.1).

Thank you!

Document on extension build options

@hubertlu-tw can we update the public document with the build options you have recently enabled?

Updated pip no longer supports `--install-option` for building without cloning

Describe the Bug
The latest changes to pip no longer include --install-option as a way to install without cloning the source, directions from here.

Minimal Steps/Code to Reproduce the Bug
Run the following commands:

pip install -U pip
pip install ninja
pip install -v --install-option="--cpp_ext" --install-option="--cuda_ext" 'git+https://github.com/ROCmSoftwarePlatform/apex.git'

Expected Behavior
Installation of apex.

Environment
Any with the latest pip.

test_autocast (test_fused_layer_norm.TestAutocastFusedRMSNorm)

Failure seen in #78

Unable to compile ROCm APEX on ROCm 3.5 PyTorch Docker Image

Hi there,
I am trying to compile apex with cuda_ext and cpp_ext enabled. I am using the latest PyTorch Docker Image from ROCM and doing the compilation of apex there. I get a lot of errors as follows.

/opt/rocm/include/hip/hip_common.h:30:9: warning: '__HIP_PLATFORM_HCC__' macro redefined [-Wmacro-redefined]
#define __HIP_PLATFORM_HCC__
        ^
<command line>:12:9: note: previous definition is here
#define __HIP_PLATFORM_HCC__ 1
        ^
error: expected '>'
csrc/hip/welford.hip:935:40: note: to match this '<'
     hipLaunchKernelGGL( welford_kernel<scalar_t_0, accscalar_t, accscalar_t>, dim3(grid), dim3(block), 0, stream,
                                       ^
error: expected '>'
csrc/hip/welford.hip:935:40: note: to match this '<'
error: expected '>'
csrc/hip/welford.hip:974:50: note: to match this '<'
     hipLaunchKernelGGL( batchnorm_forward_kernel<scalar_t_0, accscalar_t, accscalar_t>, dim3(grid), dim3(block), 0, stream,
                                                 ^
error: expected '>'
csrc/hip/welford.hip:974:50: note: to match this '<'
error: expected '>'
csrc/hip/welford.hip:992:50: note: to match this '<'
     hipLaunchKernelGGL( batchnorm_forward_kernel<scalar_t_0, accscalar_t, scalar_t_0>, dim3(grid), dim3(block), 0, stream,
                                                 ^
error: expected '>'
csrc/hip/welford.hip:992:50: note: to match this '<'
error: expected '>'
csrc/hip/welford.hip:1045:42: note: to match this '<'
     hipLaunchKernelGGL( reduce_bn_kernel<scalar_t_0, accscalar_t, accscalar_t>, dim3(grid), dim3(block), 0, stream,
                                         ^
error: expected '>'
csrc/hip/welford.hip:1045:42: note: to match this '<'
error: expected '>'
csrc/hip/welford.hip:1066:42: note: to match this '<'
     hipLaunchKernelGGL( reduce_bn_kernel<scalar_t_0, accscalar_t, scalar_t_0>, dim3(grid), dim3(block), 0, stream,
                                         ^
error: expected '>'
csrc/hip/welford.hip:1066:42: note: to match this '<'
error: expected '>'
csrc/hip/welford.hip:1115:51: note: to match this '<'
     hipLaunchKernelGGL( batchnorm_backward_kernel<scalar_t_0, accscalar_t, accscalar_t>, dim3(grid), dim3(block), 0, stream,
                                                  ^
error: expected '>'
csrc/hip/welford.hip:1115:51: note: to match this '<'
error: expected '>'
csrc/hip/welford.hip:1137:51: note: to match this '<'
     hipLaunchKernelGGL( batchnorm_backward_kernel<scalar_t_0, accscalar_t, scalar_t_0>, dim3(grid), dim3(block), 0, stream,
                                                  ^
error: expected '>'
csrc/hip/welford.hip:1137:51: note: to match this '<'
error: expected '>'
csrc/hip/welford.hip:1221:47: note: to match this '<'
     hipLaunchKernelGGL( welford_kernel_c_last<scalar_t_0, accscalar_t, accscalar_t, ELEMENTS_PER_ITER>
                                              ^
error: expected '>'
csrc/hip/welford.hip:1221:47: note: to match this '<'
error: expected '>'
csrc/hip/welford.hip:1260:57: note: to match this '<'
     hipLaunchKernelGGL( batchnorm_forward_c_last_kernel<scalar_t_0, accscalar_t, accscalar_t, ELEMENTS_PER_ITER>
                                                        ^
error: expected '>'
csrc/hip/welford.hip:1260:57: note: to match this '<'
error: expected '>'
csrc/hip/welford.hip:1281:57: note: to match this '<'
     hipLaunchKernelGGL( batchnorm_forward_c_last_kernel<scalar_t_0, accscalar_t, scalar_t_0, ELEMENTS_PER_ITER>
                                                        ^
fatal error: too many errors emitted, stopping now [-ferror-limit=]
1 warning and 20 errors generated when compiling for gfx803.
error: command '/opt/rocm/bin/hipcc' failed with exit status 1

Do you have any suggestion on how to solve this error?

Thanks!

Use correct warp size of 64 for ROCm in layernorm kernel

The current LayerNorm kern assumes WarpSize to be 32. A more performant fix is to write the kernel to be agnostic to warp size and use 64 for ROCm

Failing DDP race condition test on MI50

The following error messages are captured on MI50 server with the PyTorch build from IFU-master-2021-12-13 when running

HIP_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nproc_per_node=2 DDP/ddp_race_condition_test.py

running DDP tests
/opt/conda/lib/python3.6/site-packages/torch/distributed/launch.py:186: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See 
https://pytorch.org/docs/stable/distributed.html#launch-utility for 
further instructions

  FutureWarning,
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
i = 0
i = 0
model.a: grad.data_ptr() = 140255244058624, expected sum 16777216.0, got 16777216.0
model.a: grad.data_ptr() = 140549851971584, expected sum 16777216.0, got 16777216.0
model.b: grad.data_ptr() = 140281886277632, expected sum 8388608.0, got 8388608.0
model.b: grad.data_ptr() = 140571930787840, expected sum 8388608.0, got 8388608.0
rank 1 created group 0 with backend nccl
rank 1 created group 1 with backend nccl
rank 1 created group 2 with backend nccl
rank 0 created group 0 with backend nccl
rank 0 created group 1 with backend nccl
rank 0 created group 2 with backend nccl
Traceback (most recent call last):
  File "DDP/ddp_race_condition_test.py", line 52, in <module>
    loss.backward()
  File "/opt/conda/lib/python3.6/site-packages/torch/_tensor.py", line 352, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/opt/conda/lib/python3.6/site-packages/torch/autograd/__init__.py", line 175, in backward
    allow_unreachable=True, accumulate_grad=True)  # Calls into the C++ engine to run the backward pass
  File "/opt/conda/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/parallel/distributed.py", line 401, in allreduce_hook
    self.comm_ready_buckets(param)
  File "/opt/conda/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/parallel/distributed.py", line 535, in comm_ready_buckets
    self.allreduce_maybe_retain(self.buckets[bucket_idx], bucket_idx)
  File "/opt/conda/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/parallel/distributed.py", line 480, in allreduce_maybe_retain
    allreduced = self.allreduce_bucket(bucket, bucket_idx, force_default_stream)
  File "/opt/conda/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/parallel/distributed.py", line 450, in allreduce_bucket
    dist.all_reduce(tensor_to_allreduce, group=self.bucket_pgs[bucket_idx%self.num_allreduce_streams])
  File "/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 1314, in all_reduce
    work = group.allreduce([tensor], opts)
RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1164, unhandled system error, NCCL version 2.8.4
ncclSystemError: System call (socket, malloc, munmap, etc) failed.
Traceback (most recent call last):
  File "DDP/ddp_race_condition_test.py", line 52, in <module>
    loss.backward()
  File "/opt/conda/lib/python3.6/site-packages/torch/_tensor.py", line 352, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/opt/conda/lib/python3.6/site-packages/torch/autograd/__init__.py", line 175, in backward
    allow_unreachable=True, accumulate_grad=True)  # Calls into the C++ engine to run the backward pass
  File "/opt/conda/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/parallel/distributed.py", line 401, in allreduce_hook
    self.comm_ready_buckets(param)
  File "/opt/conda/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/parallel/distributed.py", line 535, in comm_ready_buckets
    self.allreduce_maybe_retain(self.buckets[bucket_idx], bucket_idx)
  File "/opt/conda/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/parallel/distributed.py", line 480, in allreduce_maybe_retain
    allreduced = self.allreduce_bucket(bucket, bucket_idx, force_default_stream)
  File "/opt/conda/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/parallel/distributed.py", line 450, in allreduce_bucket
    dist.all_reduce(tensor_to_allreduce, group=self.bucket_pgs[bucket_idx%self.num_allreduce_streams])
  File "/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 1314, in all_reduce
    work = group.allreduce([tensor], opts)
RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1164, unhandled system error, NCCL version 2.8.4
ncclSystemError: System call (socket, malloc, munmap, etc) failed.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 56622) of binary: /opt/conda/bin/python
Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/opt/conda/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/opt/conda/lib/python3.6/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/opt/conda/lib/python3.6/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/opt/conda/lib/python3.6/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/opt/conda/lib/python3.6/site-packages/torch/distributed/run.py", line 718, in run
    )(*cmd_args)
  File "/opt/conda/lib/python3.6/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/lib/python3.6/site-packages/torch/distributed/launcher/api.py", line 247, in launch_agent
    failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
DDP/ddp_race_condition_test.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2021-12-15_00:24:52
  host      : d740d63d6dfd
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 56623)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2021-12-15_00:24:52
  host      : d740d63d6dfd
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 56622)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Update version from 0.1

Hi,

setup.py indicates that the current version is 0.1, which does not seem up to date with the GitHub releases: https://github.com/ROCmSoftwarePlatform/apex/blob/3ba7192d1ebad65bb34615a5ee03d9e00d1b00a6/setup.py#L665. As RoCm APEX is installed in e.g. RoCm PyTorch containers, it can be confusing to have a single 0.1 version for all of the RoCm APEX releases.

Thank you!

Edit: it appears the ambiguity exists as well in NVIDIA APEX. Not sure why no proper versioning exists.

Fatal error: 'cuda_runtime_api.h' file not found

The following error message occurs when I install the apex from source on my ROCm server(CentOS 7.6).

python setup.py install --cpp_ext --cuda_ext

In file included from /public/home/apex-master/apex/contrib/csrc/optimizers/multi_tensor_distopt_adam_kernel.hip:12:
In file included from /public/home/apex-master/csrc/multi_tensor_apply.cuh:3:
/public/home/.conda/envs/my_env/lib/python3.8/site-packages/torch/include/ATen/cuda/CUDAContext.h:5:10: fatal error: 'cuda_runtime_api.h' file not found
#include <cuda_runtime_api.h>
         ^~~~~~~~~~~~~~~~~~~~
1 error generated when compiling for gfx803.
error: command '/public/software/compiler/rocm/rocm-4.0.1/bin/hipcc' failed with exit code 1

It seems that it is building 'distributed_adam_cuda' extension.

My envirments:

Currently Loaded Modulefiles:
  1) compiler/devtoolset/7.3.1   2) compiler/rocm/4.0.1         3) mpi/hpcx/2.4.1/gcc-7.3.1

python                    3.8.13 
torch                     1.10.1+rocm4.0.1

Unit tests failing with "AssertionError: If capturable=False, state_steps should not be CUDA tensors."

test_adam_option (test_fused_optimizer.TestFusedAdam): due to pytorch/pytorch#80809 (comment)
test_multi_device (test_fused_optimizer.TestFusedAdam): due to pytorch/pytorch#80809 (comment)
test_float (test_fused_optimizer.TestFusedAdam): due to pytorch/pytorch#80809 (comment)
test_bfloat16 (test_fused_optimizer.TestFusedAdam): due to pytorch/pytorch#80809 (comment)

Reference: #78 (comment)

The failing unit tests in test_transducer_joint.py

test_transducer_joint_pack_relu_dropout (test_transducer_joint.TransducerJointTest)
test_transducer_joint_relu_dropout (test_transducer_joint.TransducerJointTest)
test_transducer_joint_vec_pack_relu_dropout (test_transducer_joint.TransducerJointTest)
test_transducer_joint_vec_relu_dropout (test_transducer_joint.TransducerJointTest)

The above four unit tests with "dropout" failed with the following error messages:

Traceback (most recent call last):
  File "/apex/apex/contrib/test/transducer/test_transducer_joint.py", line 149, in test_transducer_joint_pack_relu_dropout
    self.run_transducer_joint(for_vector_kernel=False, pack_output=True, relu=True, dropout=True)
  File "/apex/apex/contrib/test/transducer/test_transducer_joint.py", line 109, in run_transducer_joint
    mask=mask if dropout else None)
  File "/apex/apex/contrib/test/transducer/transducer_ref.py", line 94, in transducer_joint_reference
    h.backward(h_grad)
  File "/opt/conda/lib/python3.7/site-packages/torch/_tensor.py", line 402, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/opt/conda/lib/python3.7/site-packages/torch/autograd/__init__.py", line 193, in backward
    allow_unreachable=True, accumulate_grad=True)  # Calls into the C++ engine to run the backward pass
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.HalfTensor [4, 101, 25, 509]], which is output 0 of ReluBackward0, is at version 2; expected version 0 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

test_transducer_joint_pack (test_transducer_joint.TransducerJointTest)
test_transducer_joint_pack_relu (test_transducer_joint.TransducerJointTest)

The above unit test failed with the following error messages:

Traceback (most recent call last):
  File "/apex/apex/contrib/test/transducer/test_transducer_joint.py", line 137, in test_transducer_joint_pack_relu
    self.run_transducer_joint(for_vector_kernel=False, pack_output=True, relu=True, dropout=False)
  File "/apex/apex/contrib/test/transducer/test_transducer_joint.py", line 115, in run_transducer_joint
    self.assertTrue(torch.allclose(f_grad_ref, f_grad_tst, atol=1e-5, rtol=1e-5))
AssertionError: False is not true

They are not reproducible with the docker (rocm/pytorch:latest == rocm5.2_ubuntu20.04_py3.7_pytorch_staging) locally. We may need to set them as flaky tests in the future or adjust the tolerance for ROCm.

Unit test failures: "if cached_x.grad_fn.next_functions[1][0].variable is not x: IndexError: tuple index out of range"

The failing unit tests introduced by PyTorch commits related to "if cached_x.grad_fn.next_functions[1][0].variable is not x: IndexError: tuple index out of range" also observed on CUDA (upstream).

test_conv2d_is_half (test_basic_casts.TestBasicCastsHalf)
test_linear_is_half (test_basic_casts.TestBasicCastsHalf)
test_whitelist_module_fp32_weight (test_cache.TestCache)
test_restoring (test_checkpointing.TestCheckpointing)
test_state_dict (test_checkpointing.TestCheckpointing)
test_gru_cell_is_half (test_rnn.TestRnnCells)
test_lstm_cell_is_half (test_rnn.TestRnnCells)
test_rnn_cell_is_half (test_rnn.TestRnnCells)

test_mlp benchmark got accuracy assert error

hi, team,

I tried to benchmark on mlp implement with following:

env setup

GPU: MI308
rocm: 6.1.2.60102-119~20.04
pytorch: 2.4.0.dev20240501+rocm6.1

how to duplicate the process

cd apex/
pip install -r requirements.txt 
python3 setup.py install 
cd tests/run_mlp
python3 test_mlp.py

accuracy errors

FAIL: test_no_bias (__main__.TestMLP)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/apex-1.3.0-py3.9-linux-x86_64.egg/apex/testing/common_utils.py", line 32, in wrapper
    fn(*args, **kwargs)
  File "/workspace/apex/tests/L0/run_mlp/test_mlp.py", line 77, in test_no_bias
    np.testing.assert_allclose(
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/numpy/testing/_private/utils.py", line 1530, in assert_allclose
    assert_array_compare(compare, actual, desired, err_msg=str(err_msg),
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/numpy/testing/_private/utils.py", line 844, in assert_array_compare
    raise AssertionError(msg)
AssertionError:
Not equal to tolerance rtol=1e-05, atol=1e-07

Mismatched elements: 2 / 1024 (0.195%)
Max absolute difference: 1.8835999e-07
Max relative difference: 0.00286722
 x: array([[ 0.027259],
       [-0.054091],
       [-0.003985],...
 y: array([[ 0.027259],
       [-0.054091],
       [-0.003985],...

FAIL: test_no_grad (__main__.TestMLP)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/apex-1.3.0-py3.9-linux-x86_64.egg/apex/testing/common_utils.py", line 32, in wrapper
    fn(*args, **kwargs)
  File "/workspace/apex/tests/L0/run_mlp/test_mlp.py", line 163, in test_no_grad
    np.testing.assert_allclose(
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/numpy/testing/_private/utils.py", line 1530, in assert_allclose
    assert_array_compare(compare, actual, desired, err_msg=str(err_msg),
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/numpy/testing/_private/utils.py", line 844, in assert_array_compare
    raise AssertionError(msg)
AssertionError:
Not equal to tolerance rtol=1e-05, atol=1e-07

Mismatched elements: 140028 / 491520 (28.5%)
Max absolute difference: 7.2151306e-07
Max relative difference: 951.6179
 x: array([[-2.597046e-05,  4.191594e-06, -6.009603e-05, ...,  2.606537e-04,
         6.171300e-05, -6.382344e-05],
       [ 1.673573e-05, -5.885254e-05, -8.349993e-05, ..., -7.531334e-05,...
 y: array([[-2.654276e-05,  3.729273e-06, -6.010690e-05, ...,  2.608544e-04,
         6.124897e-05, -6.427429e-05],
       [ 1.673585e-05, -5.885251e-05, -8.349984e-05, ..., -7.531326e-05,...

FAIL: test_with_bias (__main__.TestMLP)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/apex-1.3.0-py3.9-linux-x86_64.egg/apex/testing/common_utils.py", line 32, in wrapper
    fn(*args, **kwargs)
  File "/workspace/apex/tests/L0/run_mlp/test_mlp.py", line 116, in test_with_bias
    np.testing.assert_allclose(
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/numpy/testing/_private/utils.py", line 1530, in assert_allclose
    assert_array_compare(compare, actual, desired, err_msg=str(err_msg),
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/numpy/testing/_private/utils.py", line 844, in assert_array_compare
    raise AssertionError(msg)
AssertionError:
Not equal to tolerance rtol=1e-05, atol=1e-07

Mismatched elements: 3 / 1024 (0.293%)
Max absolute difference: 1.899898e-07
Max relative difference: 0.00063155
 x: array([[-0.128916],
       [-0.052111],
       [ 0.001069],...
 y: array([[-0.128916],
       [-0.052111],
       [ 0.001069],...

----------------------------------------------------------------------
Ran 6 tests in 16.497s

FAILED (failures=3)

is a special torch/rocm version required for this benchmark ?

many thanks
David

NaNs in test_half (test_fused_optimizer.TestFusedAdam)

The failure for test_half (test_fused_optimizer.TestFusedAdam) is only observed on ROCm. There are some NaNs "sporadically" (99% values are correct compared to the outputs with torch.optim.Adam) showing in the outputs after apex.optimizers.FusedAdam is called to update its parameters.

python setup.py clean does not fully clean up build folder

Building with

python setup.py install --cpp_ext --cuda_ext

leaves the following artifacts in the build directory:

bdist.linux-x86_64  lib.linux-x86_64-3.10  temp.linux-x86_64-3.10

cleaning the the build with

python setup.py clean

results in the following output:

...
running clean
removing 'build/temp.linux-x86_64-3.10' (and everything under it)

this leaves the following artifacts in the build directory:

bdist.linux-x86_64  lib.linux-x86_64-3.10

I discovered this problem by first building apex with rocm 5.0.2, cleaning apex using python setup.py clean, then re-building with rocm 4.5.2, and discovering that there was an import error when running the following line:

>>> from apex.multi_tensor_apply import multi_tensor_applier
>>> multi_tensor_applier.available
False
>>> multi_tensor_applier.import_err
ImportError('libamdhip64.so.5: cannot open shared object file: No such file or directory')

rocm 4.5.2 has libamdhip64.so.4 and rocm 5.0.2 has libamdhip64.so.5

no module torch._six

PyTorch 2.0 work using upstream TOT showed the title error, which is ultimately from this PR pytorch/pytorch#94709.

File apex/amp/_initialize.py
❱ from torch._six import string_classes

Solution is to cherry pick upstream apex for the fix.
NVIDIA@6943fd2