Giter VIP home page Giter VIP logo

ret's Introduction

Welcome to RET (ROCm Enablement Tool) Status

RET is a comprehensive checking, set up, installation, testing and benchmarking tool which does carry out the installation of ROCm suite ranging from dependencies, drivers and toolchain to framework and benchmark. RET makes the process of carrying out automated ROCm installation incredibly simple and provides a more user friendly and faster installation experience.

  • Install Linux OS
  • Run ret
  • Run your TensorFlow benchmark OR Train your own model with TensorFlow

Hardware Support and supported GPU

please refer to ROCm main repository at ROCmInstall.

Getting started

Supported OS

  • Ubuntu:
    • 16.04
    • 18.04
  • CentOS 7.6 (TensorFlow run on Docker)

Prerequisites

Note: it is required to start with a clean system

Formatting a hard drive along with the install of a new OS is the best option after the installation you will need git to download the RET source

  sudo apt -y install git
  git clone https://github.com/rocmsys/RET.git

Usage

sudo ./ret  <command> [<option>]
e.g.
sudo ./ret install rocm or sudo ./ret install tensorflow

Command options

Command:
              [install]   <Package>              : Install ROCm or ML Framework TF/PT
              [remove]    <Package>              : Remove ROCm or ML Framework TF/PT
              [benchmark] <Packages> <Model>     : Run benchmark for specific ML Framework
              [build] <Container> <ImageName>    : Build ROCm Container either with Docker or Singularity

   Packages:
              [rocm]                             : ROCm-dkms packages
              [tensorflow]                       : TensorFlow framework

   Model:
              [resnet56]                         : ResNet-56 model. Default Model
   Container:
              [docker]                           : Build Docker Container
              [singularity]                      : Build Singularity Container
              [ImageName]                        : Choosing an OS Base Image. Default is [ubuntu:18.04]
    
Options:
              [-py2|-py3]                        : Python version. Default is Python3
              [-h|--help]                        : Show this help message
              [-v|--version]                     : Show version of this package
              [-V|--verbose]                     : Be verbose
              [-d|--debug]                       : Enable Debug Mode
              [-y|--yes]                         : Skip confirmation message
              [-ns|--nsc]                        : Skip system check steps
              [-nv|--nov]                        : Skip verification steps
              [-ic|--incontainer]                : Run RET on top of Container

RUN RET

   cd RET
   sudo ./ret install rocm         # install ROCm stack
   sudo reboot
   sudo ./ret install tensorflow   # install TensorFlow

TensorFlow's benchmarks

Details on the benchmarks can be found at this Link.

Here are the basic instructions to run ResNet-56 benchmark:

sudo ./ret benchmark tensorflow resnet56

You can also use the TensorFlow benchmarks:

Download tensorflow benchmark

git clone https://github.com/reger-men/tensorflow_benchmark.git

Run the training benchmark (e.g. ResNet-56)

python3 train.py

Note: You may need to add your GPU number --num_gpus=YOUR_GPU_NUMBER

ToDo Checklist

  • Support Ubuntu 16.04
  • Support Ubuntu 18.04
  • Support CentOS 7.6
  • Support RHEL 7.6
  • tensorflow on Ubuntu
  • tensorflow on CentOS
  • tensorflow on RHEL
  • pytorch on Ubuntu
  • pytorch on CentOS
  • pytorch on RHEL
  • Check System Compatibility
  • Check HW Compatibility
  • Adapt RET on top of Docker Container
  • Cloud Support

Project Stats

ret's People

Contributors

chinchoretejas avatar domcharrier avatar japarada avatar msabony1966 avatar paklui avatar reger-men avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

ret's Issues

Support ROCm-enabled MPI library and OSU benchmark

Is your feature request related to a problem? Please describe.
The Open MPI in OS distro does not support GPU for P2P communication, which requires ROCm-enabled UCX that needs to be build separately. See #11

Describe the solution you'd like
To build Open MPI support with UCX for ROCm described here: https://github.com/openucx/ucx/wiki/Build-and-run-ROCM-UCX-OpenMPI. Also include OSU Micro Benchmark for sanity testing.

Describe alternatives you've considered
ROCm software repository will ultimately provide this ROCm-enabled MPI support in the future, to provide binary distribution of Open MPI and UCX for each distribution, because building it each time can take half hour.

Additional context
ret.log

Tensorflow benchmarks doesn't work

System information

  • Ubuntu 18.04 LTS
  • 4x Vega Frontier Edition
  • AMD Ryzen Threadripper 2950X and 32 GB of RAM

Describe the bug
Unable to run step 8/12 of the installation script

To Reproduce:

  1. Do a clean Ubuntu 18.04 installation
  2. Run sudo ./ret install rocm
  3. Reboot
  4. Run sudo ./ret install tensorflow
  5. Accept (y) to run benchmarks

Expected behavior
When running the benchmark this will happen

=================================================================================================================================================================================
[+] [STEP:8/12] Verifing Tensorflow installation...
=================================================================================================================================================================================
Run Tensorflow benchmark? [Y/n] y

  [+] ⡇ Run Tensorflow Verification...
Traceback (most recent call last):
  File "/home/emoon/RET/benchmarks/tf/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py", line 29, in <module>
    import benchmark_cnn
  File "/home/emoon/RET/benchmarks/tf/scripts/tf_cnn_benchmarks/benchmark_cnn.py", line 44, in <module>
    import datasets
  File "/home/emoon/RET/benchmarks/tf/scripts/tf_cnn_benchmarks/datasets.py", line 32, in <module>
    import preprocessing
  File "/home/emoon/RET/benchmarks/tf/scripts/tf_cnn_benchmarks/preprocessing.py", line 28, in <module>
    from tensorflow.contrib.data.python.ops import threadpool
ModuleNotFoundError: No module named 'tensorflow.contrib'
  [+] ⡏ Run resnet50 benchmark...
✘ [ERROR] resnet50 benchmark Failed...[]

Additional context / logs
I'm pretty sure this is not caused by the installation script but by the the benchmark not being updated but I wanted to report it anyway.

Consider supporting EasyBuild

Please take a look at https://easybuild.readthedocs.io/en/latest/ - this tool is in use by over half of all the major HPC installations around the world. It is well structured and provides most of the infrastructure you need to install some software. It would be great if you can provide your stack through easybuild, making it instantly available to the HPC users around the world. Thanks.

Path error in MPI benchmark

System information

  • Ubuntu 18.04 LTS
  • MI50
  • EPYC 7742

Describe the bug

Running the OpenMPI benchmarks result in a crash due to command
mkdir -p /RET_MPI/omb.

A folder RET_MPI exists in the RET root directory. I suspect
the above path should point to this directory.

To Reproduce:

# cd <RET-root>
sudo ./ret install rocm
sudo ./ret install mpi

Expected behavior

Benchmarks complete without errors.

Additional context / logs

log/ret.log

[2020.01.20  @ 12:00:53:780283254] [DON] Build Open MPI completed successfully. []
[2020.01.20  @ 12:00:53:798066022] [DON] Install Completed Successfully. []
[2020.01.20  @ 12:00:53:817972288] [SUB] \e[0K⡇ Run mpi Post-Installation: Setting PATH and LD_LIBRARY_PATH in /etc/profile.d/mpi.sh  []
[2020.01.20  @ 12:00:53:817237087] [CMD] echo 'export PATH=/home/hpcuser/work/RET/RET_MPI/RET_MPI/ompi/bin:/home/hpcuser/work/RET/RET_MPI/RET_MPI/ucx/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin' | tee -a /etc/profile.d/mpi.sh [runCmd]
[2020.01.20  @ 12:00:53:836970005] [OUT] export PATH=/home/hpcuser/work/RET/RET_MPI/RET_MPI/ompi/bin:/home/hpcuser/work/RET/RET_MPI/RET_MPI/ucx/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin
[2020.01.20  @ 12:00:53:854840308] [CMD] echo 'export LD_LIBRARY_PATH=/home/hpcuser/work/RET/RET_MPI/RET_MPI/ompi/lib:/home/hpcuser/work/RET/RET_MPI/RET_MPI/ucx/lib:/home/hpcuser/work/RET/RET_MPI/RET_MPI/gdrcopy/lib64:' | tee -a /etc/profile.d/mpi.sh [runCmd]
[2020.01.20  @ 12:00:53:874062610] [OUT] export LD_LIBRARY_PATH=/home/hpcuser/work/RET/RET_MPI/RET_MPI/ompi/lib:/home/hpcuser/work/RET/RET_MPI/RET_MPI/ucx/lib:/home/hpcuser/work/RET/RET_MPI/RET_MPI/gdrcopy/lib64:
[2020.01.20  @ 12:00:53:891704371] [CMD] source /etc/profile.d/mpi.sh [runCmd]
[2020.01.20  @ 12:00:53:913592327] [INF] Run OSU MPI benchmark? [Y/n] [Y]
[2020.01.20  @ 12:00:53:937328920] [SUB] \e[0K⡇ Setting up OSU MPI Benchmark []
[2020.01.20  @ 12:00:53:936121907] [CMD] su -p hpcuser -c 'rm -fr /RET_MPI/omb' [runCmd]
[2020.01.20  @ 12:00:53:964838774] [CMD] su -p hpcuser -c 'mkdir -p /RET_MPI/omb' [runCmd]
[2020.01.20  @ 12:00:53:992012210] [OUT] mkdir: cannot create directory '/RET_MPI': Permission denied
[2020.01.20  @ 12:00:54:009803224] [ERR] An error occurred while executing the command! [su -p hpcuser -c 'mkdir -p /RET_MPI/omb']  

RET Install Failing with rocm-profiler not found

I am unable to complete ROCm install using RET on fresh images of Ubuntu 1804. Here is the error log:

[2020.03.11 @ 12:35:58:470768242] [CMD] apt-get -qq -y --allow-unauthenticated install 'rocm-profiler' [runCmd]
[2020.03.11 @ 12:35:58:471675244] [SUB] \e[0K⡇ Installing Package: rocm-profiler []
[2020.03.11 @ 12:35:58:890484545] [OUT] E: Unable to locate package rocm-profiler
[2020.03.11 @ 12:35:58:904542343] [ERR] An error occurred while executing the command! [apt-get -qq -y --allow-unauthenticated install 'rocm-profiler']

Support building HOOMD-Blue in RET

Is your feature request related to a problem? Please describe.
To enable RET to build HOOMD-Blue and its dependencies in Miniconda3

Describe the solution you'd like
Using the installation steps and implement in src/cmd and src/cmd_utils.
Details of the installation steps to setup conda and its package requirements for HOOMD-Blue is described below.

Describe alternatives you've considered
Consider to install directly via bash scripts below.

Additional context
ret.log for install: ret-hoomd-install.log

Installation on HOOMD-Blue web site:
https://hoomd-blue.readthedocs.io/en/stable/installation.html#compiling-from-source

Current installation instruction for HIP port:

Installation instructions for HOOMD-blue -- HIP branch on AMD servers
=====================================================================

12/02/2019 12/10/2019 12/13/2019 12/19/2019
[email protected], [email protected], [email protected]

Build Open MPI with ROCm-enabled UCX support

while HOOMD-blue (HIP branch) is under development, the documentation will be kept up to date here
https://hoomd-blue.readthedocs.io/en/hip/installation.html

(see there for dependencies, unless superseded by this doc)

see here for latest progress
https://github.com/glotzerlab/hoomd-blue/pull/541

# for multi-node MPI https://github.com/openucx/ucx/wiki/OpenMPI-and-OpenSHMEM-installation-with-UCX
# export LD_LIBRARY_PATH=/home/rocmhoomds/ompi/ompiinstall/lib:$LD_LIBRARY_PATH

# Checkout ROCm dependencies
sudo apt-get install hipsparse
sudo apt-get install rocfft
sudo apt-get install hipcub # removed -y
sudo apt-get install rocrand # removed -y
sudo apt-get install rocthrust # removed -y
sudo apt-get install roctracer-dev # new

# there may be a conflict between NVIDIA thrust and rocThrust if the NVIDIA CUDA toolkit is simultaneously installed in /usr.
# (in that case uninstall the NVIDIA toolkit)

# host compilers I tested the HIP branch with: gcc-7.4 and clang++-10

# Install gcc 7.4.0 from source and apt-get
# try running ./contrib/download_prerequisites.sh from the gcc source dir. It worked for me (for the current version of gcc though (gcc-4.7)

# checkout HOOMD-blue (hip branch)
git clone https://github.com/glotzerlab/hoomd-blue
cd hoomd-blue
git checkout next #hip
git submodule update --init

# set up python environment
cd $HOME
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
sh Miniconda3-latest-Linux-x86_64.sh
reply yes to Do you wish the installer to initialize Miniconda3 by running conda init

# close and reopen shell (login logout)
conda install -y -c anaconda python=3.8
conda update -y --all
conda create -y -n myenv python=3.8 # or rename myenv to something like hoomd-single-mpi, to differentiate and switch between builds
echo "conda activate myenv" >> ~/.bashrc
source ~/.bashrc
conda activate myenv

# newest version >= 3.14.0
conda install -y cmake 
conda install -y numpy
conda install -y pybind11
conda install -y eigen
conda install -y -c conda-forge cereal 
conda install -y -c conda-forge signac-flow

# build hoomd in single precision (md component only)
mkdir build; cd build
echo "export CMAKE_PREFIX_PATH=$CONDA_PREFIX" >> ~/.bashrc

# several build alternatives hexagon and depletion need -D SINGLE_PRECISION=OFF 
cmake -D PYTHON_EXECUTABLE=`which python3` -D ENABLE_GPU=ON -D SINGLE_PRECISION=ON -D ENABLE_MPI=OFF ../hoomd-blue/
cmake -D PYTHON_EXECUTABLE=`which python3` -D ENABLE_GPU=ON -D SINGLE_PRECISION=OFF -D ENABLE_MPI=OFF ../hoomd-blue/
cmake -D PYTHON_EXECUTABLE=`which python3` -D ENABLE_GPU=ON -D SINGLE_PRECISION=OFF -D ENABLE_MPI=ON ../hoomd-blue/
cmake -D PYTHON_EXECUTABLE=`which python3` -D ENABLE_GPU=ON -D SINGLE_PRECISION=ON -D ENABLE_MPI=ON ../hoomd-blue/ 

# Note: For ROCm-enabled MPI, use the ENABLE_MPI_CUDA=ON flag
cmake -D PYTHON_EXECUTABLE=`which python3` -D ENABLE_GPU=ON -D SINGLE_PRECISION=ON -D ENABLE_MPI=ON -D ENABLE_MPI_CUDA=ON -D CMAKE_INSTALL_PREFIX=${HOME}/miniconda3/envs/myenv -D PYTHON_SITE_INSTALL_DIR=${HOME}/miniconda3/lib/python3.8/site-packages/hoomd ../hoomd-blue/

make -j 64 install

# clone hoomd-benchmarks and dependencies (next branch)
cd $HOME
git clone https://[email protected]/glotzerlab/hoomd-benchmarks
cd hoomd-benchmarks
git checkout next

# run LJ liquid benchmark

# default size 1M particles
mpirun -np 1 python project.py run lj_liquid-benchmark-gpu_np1 # 1 GPU
mpirun -np 4 python project.py run lj_liquid-benchmark-gpu_np1 # 4 GPUs
mpirun -np 8 python project.py run lj_liquid-benchmark-cpu_np8 # 8 CPU cores

# custom size N=50^3 particles
cd lj_liquid
python init.py 50
cd ..
python project.py run -o lj_liquid-equilibrate # equilibrate
python project.py run -o lj_liquid-benchmark-gpu_np1 -f n 50

# or, run **all** benchmarks (lj_liquid, patchy_protein, hexagon, microsphere, quasicrystal, depletion, spce)
python project.py run

# retrieve performance (TPS or MPS=N*TPS) from data base
signac document -f benchmark lj_liquid

# - due to a bug in HIP (likely related to https://github.com/ROCm-Developer-Tools/HIP/pull/1698) this command may abort after
#   some time with a message saying it could not find a kernel. This is likely because a shared library is lazily loaded
#   in between GPU kernel calls. In that case, execute benchmarks seprately

# - if a benchmark fails with an unknown HSA error, or one like the following, 
#   ### HCC STATUS_CHECK Error: HSA_STATUS_ERROR_INVALID_ISA (0x100f) at file:mcwamp_hsa.cpp line:1191
#   this is probably due to the issue addressed in
#   https://github.com/ROCm-Developer-Tools/HIP/pull/1676


Issue with PyThread_tss_alloc, because OMPI is not build with thread support?

ROCm 3.5 supports Ubuntu 18.04 with kernel 5.3

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 18.04

Describe the bug
When using a current version of Ubuntu 18.04, you will have kernel 5.3 installed. But ret complains about the installations:

===========================================================================================================================================================
[+] [STEP:2/12] Checking OS...
===========================================================================================================================================================
        OS                           ..........  ✔ Ubuntu
        RELEASE                      ..........  ✔ 18.04
        KERNEL                       ..........  ✘ 5.3.0-59-generic
        ARCH                         ..........  ✔ amd64

[WARNING] OS kernel Not Supported!...[5.3.0-59-generic]

[INFO] A kernel update to version 5.0 is required to complete this setup.

[INFO] The Kernel 5.0 is Already Installed! Please Reboot The System With This Kernel

While ROCm release notes state that Ubuntu 18.04 with kernel 5.3 is supported:
https://rocmdocs.amd.com/en/latest/Current_Release_Notes/Current-Release-Notes.html#amd-rocm-release-notes-v3-5-0

Expected behaviour

  1. Support kernel 5.3 with Ubuntu 18.04
  2. Do not issue a warning about "kernel update" if it is actually a "kernel downgrade"

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.