I am trying to use more than one GPU for warpx on our AMD machine without success.

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

run-file.sh : <div class="snippet-clipboard-conte

running on multiGPU machine,about ecp-warpx/warpx

Comments (44)

denisbertini commented on July 30, 2024 2

The other script i sent before is just another variant.
I this case the warpx are already installed within the container and all warpx_1,2,3d executables can then be
executed directly.

from warpx.

denisbertini commented on July 30, 2024 2

@ax3l BTW i created different singularity definition files which could eventually allow any user to run WarpX on any HPC system which provide singularity/apptainer as container technique.
Defintiion files for CPUs and GPU (AMD+ROCM) are provided.

https://git.gsi.de/d.bertini/pp-containers

May be interesting for WarpX user?

from warpx.

denisbertini commented on July 30, 2024 1

Yes, this is actually exactly the way i do.

from warpx.

denisbertini commented on July 30, 2024 1

this is now set, waiting for the next release to solve that eventually.

from warpx.

WeiqunZhang commented on July 30, 2024 1

Yes, I have some thoughts on how to handle this.

from warpx.

RemiLehe commented on July 30, 2024

Thanks for your interest in the code.
Could you share the content of your run-file.sh script? In particular, are you calling an MPI runner, such as mpirun, srun or jsrun from inside run-file.sh?

from warpx.

denisbertini commented on July 30, 2024

for example submitting with the above slurm command gives at initialisation:


MPI initialized with 4 MPI processes
MPI initialized with thread support level 3
Initializing HIP...
HIP initialized with 1 device.
AMReX (23.05) initialized
PICSAR (1903ecfff51a)
WarpX (23.05)

    __        __             __  __
    \ \      / /_ _ _ __ _ __\ \/ /
     \ \ /\ / / _` | '__| '_ \\  /
      \ V  V / (_| | |  | |_) /  \
       \_/\_/ \__,_|_|  | .__/_/\_\
                        |_|

Level 0: dt = 1.530214125e-17 ; dx = 5.580357143e-09 ; dz = 8.081896552e-09

Grids Summary:
  Level 0   29 grids  9977856 cells  100 % of domain
            smallest grid: 2688 x 128  biggest grid: 2688 x 128

Should it be HIP initialised with 4 devices and 1 GPU device per MPI rank?

from warpx.

denisbertini commented on July 30, 2024

run-file.sh :

#!/bin/bash

#export CONT=/lustre/rz/dbertini/containers/prod/gpu/rlx8_rocm-5.4.3_warpx.sif
export CONT=/cvmfs/phelix.gsi.de/sifs/gpu/rlx8_rocm-5.4.3_warpx.sif
export WDIR=/lustre/rz/dbertini/gpu/warpx

export OMPI_MCA_io=romio321
export APPTAINER_BINDPATH=/lustre/rz/dbertini/,/cvmfs

srun --export=ALL -- $CONT warpx_2d  $WDIR/scripts/inputs/warpx_opmd_deck

from warpx.

RemiLehe commented on July 30, 2024

OK, thanks.
Based on the output that you sent, I think that WarpX is actually using 4 GPUs. I think that the message
HIP initialized with 1 device. is to be understood per MPI rank.
@WeiqunZhang Could you confirm that this is the case?

from warpx.

denisbertini commented on July 30, 2024

unfortunately on the node where the job is running i can notice the use of only one GPU, as it is showed by the utility
program rocm-smi

from warpx.

denisbertini commented on July 30, 2024

this for example a snapshot of the rocm-smi output when the job is running:

GPU  Temp (DieEdge)  AvgPwr  SCLK     MCLK     Fan  Perf  PwrCap  VRAM%  GPU%
0    71.0c           198.0W  1502Mhz  1200Mhz  0%   auto  290.0W   84%   99%
1    38.0c           34.0W   300Mhz   1200Mhz  0%   auto  290.0W    2%   0%
2    38.0c           34.0W   300Mhz   1200Mhz  0%   auto  290.0W    2%   0%
3    39.0c           34.0W   300Mz   1200Mhz  0%   auto  290.0W    2%   0%
4    59.0c           40.0W   300Mhz  1200Mhz  0%   auto  290.0W    0%   0%
5    38.0c           34.0W   300Mhz  1200Mhz  0%   auto  290.0W    0%   0%
6    40.0c           34.0W   300Mhz  1200Mhz  0%   auto  290.0W    0%   0%
7    40.0c           35.0W   300Mhz  1200Mhz  0%   auto  290.0W    0%   0%
=====40===============9========================================================
=====40===============4====== End o ROCm SMI Log ==============================

one can see that only GPU at index 0 is used the other 4 requested are idle... any idea?

from warpx.

WeiqunZhang commented on July 30, 2024

HIP initialized with 1 device. means only one device is being used by all the processes. This is most likely a job script issue.

from warpx.

denisbertini commented on July 30, 2024

yes the question for me it still not clear what could be possibly wring in my job script ...

from warpx.

WeiqunZhang commented on July 30, 2024

It depends on how slurm is configured on that system. Maybe try to change --cpus-per-task 1. Figure out how many CPUs you have on a node and divide that by 4. The issue might be all your processes were using cpus that were close one GPU and that GPU was mapped to all 4 processes. There is nothing we can do in a C++ code, if the GPUs are not visible to us.

from warpx.

denisbertini commented on July 30, 2024

i have 96 processors on one machine, so if i change to --cpus-per-task 24 i still use only one GPU, does not help

from warpx.

WeiqunZhang commented on July 30, 2024

Maybe instead of --gres=gpu:4, you can try --gpus-per-taks=1 and --gpu-bind=verbose,single:1.

from warpx.

denisbertini commented on July 30, 2024

doing your change i got the following error:

gpu-bind: usable_gres=0x8; bit_alloc=0xF; local_inx=4; global_list=3; local_list=3
gpu-bind: usable_gres=0x1; bit_alloc=0xF; local_inx=4; global_list=0; local_list=0
gpu-bind: usable_gres=0x4; bit_alloc=0xF; local_inx=4; global_list=2; local_list=2
gpu-bind: usable_gres=0x2; bit_alloc=0xF; local_inx=4; global_list=1; local_list=1
amrex::Abort::1::HIP error in file /tmp/warp/WarpX/build_2d/_deps/fetchedamrex-src/Src/Base/AMReX_GpuDevice.cpp line 163 no ROCm-capable device is detected !!!
SIGABRT
amrex::Abort::2::HIP error in file /tmp/warp/WarpX/build_2d/_deps/fetchedamrex-src/Src/Base/AMReX_GpuDevice.cpp line 163 no ROCm-capable device is detected !!!
SIGABRT
amrex::Abort::3::HIP error in file /tmp/warp/WarpX/build_2d/_deps/fetchedamrex-src/Src/Base/AMReX_GpuDevice.cpp line 163 no ROCm-capable device is detected !!!

from warpx.

denisbertini commented on July 30, 2024

so it seems that from warpx or amrex perspective, there is only one GPU device on the node ?

from warpx.

denisbertini commented on July 30, 2024

resubmitting with --ntasks-per-node 1 works fine, only one GPU is visible to warpx

from warpx.

WeiqunZhang commented on July 30, 2024

The error message means processes 1, 2 and 3 see zero GPUs as reported by hipGetDeviceCount. Only process 0 sees a GPU.

You can also run rocm-smi instead of warpx. I suspect you will see the same behavior. That is only one GPU in total is available.

from warpx.

denisbertini commented on July 30, 2024

that is a good idea !

from warpx.

denisbertini commented on July 30, 2024

so rocm-smi sees all the 8 devices

======================= ROCm System Management Interface =======================
================================= Concise Info =================================
GPU  Temp (DieEdge)  AvgPwr  SCLK    MCLK     Fan  Perf  PwrCap  VRAM%  GPU%  
0    45.0c           35.0W   300Mhz  1200Mhz  0%   auto  290.0W    0%   0%    
1    43.0c           34.0W   300Mhz  1200Mhz  0%   auto  290.0W    0%   0%    
2    43.0c           34.0W   300Mhz  1200Mhz  0%   auto  290.0W    0%   0%    
3    44.0c           34.0W   300Mhz  1200Mhz  0%   auto  290.0W    0%   0%    
4    41.0c           39.0W   300Mhz  1200Mhz  0%   auto  290.0W    0%   0%    
5    40.0c           34.0W   300Mhz  1200Mhz  0%   auto  290.0W    0%   0%    
6    40.0c           34.0W   300Mhz  1200Mhz  0%   auto  290.0W    0%   0%    
7    39.0c           38.0W   300Mhz  1200Mhz  0%   auto  290.0W    0%   0%    
================================================================================
============================= End of ROCm SMI Log ==============================

from warpx.

denisbertini commented on July 30, 2024

and it has been launched using the same slurm command as for warpx

from warpx.

WeiqunZhang commented on July 30, 2024

Did you run it under srun?

from warpx.

denisbertini commented on July 30, 2024

yes same command

from warpx.

denisbertini commented on July 30, 2024

additionnaly i ran another pic code based on gpu i.e picongpu and this seems to see all devices and use all GPUS on the machine without problem

from warpx.

WeiqunZhang commented on July 30, 2024

srun --export=ALL -- $CONT rocm-smi $WDIR/scripts/inputs/warpx_opmd_deck?

from warpx.

denisbertini commented on July 30, 2024

no without the input deck file just
srun --export=ALL -- $CONT rocm-smi

from warpx.

WeiqunZhang commented on July 30, 2024

What is /cvmfs/phelix.gsi.de/sifs/gpu/rlx8_rocm-5.4.3_warpx.sif?

from warpx.

denisbertini commented on July 30, 2024

this is a singularity container which contains all the software stack to run warpx

from warpx.

denisbertini commented on July 30, 2024

included rocm

from warpx.

WeiqunZhang commented on July 30, 2024

Could you add the following lines after line 256 (device_id = my_rank % gpu_device_count;) of build/_deps/fetchedamrex-src/Src/Base/AMReX_GpuDevice.cpp and recompile it?

        amrex::AllPrint() << "Proc. " << ParallelDescriptor::MyProc()
                          << ": nprocspernode = " << ParallelDescriptor::NProcsPerNode()
                          << ", my_rank = " << my_rand << ", device count = "
                          << gpu_device_count << "\n";

Hopefully this can give us more information.

from warpx.

denisbertini commented on July 30, 2024

resintalling with v 23.06 and the modifs on the AMREX code you asked for gives:

MPI initialized with 4 MPI processes
MPI initialized with thread support level 3
Initializing HIP...
Proc. 0: nprocspernode = 1, my_rank = 0, device count = 4
Proc. 1: nprocspernode = 1, my_rank = 0, device count = 4
Proc. 2: nprocspernode = 1, my_rank = 0, device count = 4
Proc. 3: nprocspernode = 1, my_rank = 0, device count = 4
HIP initialized with 1 device.
AMReX (23.06) initialized
PICSAR (1903ecfff51a)
WarpX (Unknown)

strange... my_rank is always 0 ?

from warpx.

WeiqunZhang commented on July 30, 2024

That my_rank is the rank in a subcommunicator with the type of MPI_COMM_TYPE_SHARED. The issue is there is only one process per "node", probably because of the container. That is in this configuration, the CPUs and their memory are not shared, whereas the GPUs are shared.

You can try to map only one GPU to each MPI task in the slurm job script maybe with more explicit gpu mapping. Or you can modify your amrex souce code. You can make the following change that should work for your specific case.

diff --git a/Src/Base/AMReX_GpuDevice.cpp b/Src/Base/AMReX_GpuDevice.cpp
index d709531440..cfd7a39e5c 100644
--- a/Src/Base/AMReX_GpuDevice.cpp
+++ b/Src/Base/AMReX_GpuDevice.cpp
@@ -253,6 +253,7 @@ Device::Initialize ()
         // ranks to GPUs, assuming that socket awareness has already
         // been handled.
 
+        my_rank = ParallelDescriptor::MyProc();
         device_id = my_rank % gpu_device_count;
 
         // If we detect more ranks than visible GPUs, warn the user

We will try to fix this in the next release.

from warpx.

denisbertini commented on July 30, 2024

But as well gpu_device_count is not correct, it should be 8 and not 4 in my case...

from warpx.

WeiqunZhang commented on July 30, 2024

I think that's because of --ntasks-per-node 4 --cpus-per-task 1 --gres=gpu:4.

from warpx.

denisbertini commented on July 30, 2024

ah that is correct ! thx !

from warpx.

denisbertini commented on July 30, 2024

Your patch seems to work:

MPI initialized with 4 MPI processes
MPI initialized with thread support level 3
Initializing HIP...
Proc. 1: nprocspernode = 1, my_rank = 1, device count = 4
Proc. 3: nprocspernode = 1, my_rank = 3, device count = 4
Proc. 0: nprocspernode = 1, my_rank = 0, device count = 4
Proc. 2: nprocspernode = 1, my_rank = 2, device count = 4
HIP initialized with 4 devices.
AMReX (23.06) initialized
PICSAR (1903ecfff51a)
WarpX (Unknown)

But it is quite a change in the AMReX logic !

from warpx.

denisbertini commented on July 30, 2024

and controlling with rocm-smi indeed shows the proper GPU usage:

======================= ROCm System Management Interface =======================
================================= Concise Info =================================
GPU  Temp (DieEdge)  AvgPwr  SCLK     MCLK     Fan  Perf  PwrCap  VRAM%  GPU%  
0    63.0c           100.0W  1502Mhz  1200Mhz  0%   auto  290.0W   79%   92%   
1    61.0c           326.0W  1502Mhz  1200Mhz  0%   auto  290.0W   79%   99%   
2    72.0c           249.0W  1502Mhz  1200Mhz  0%   auto  290.0W   79%   99%   
3    62.0c           252.0W  1502Mhz  1200Mhz  0%   auto  290.0W   79%   99%   
4    43.0c           39.0W   300Mhz   1200Mhz  0%   auto  290.0W    0%   0%    
5    42.0c           34.0W   300Mhz   1200Mhz  0%   auto  290.0W    0%   0%    
6    42.0c           34.0W   300Mhz   1200Mhz  0%   auto  290.0W    0%   0%    
7    40.0c           37.0W   300Mhz   1200Mhz  0%   auto  290.0W    0%   0%    
================================================================================
============================= End of ROCm SMI Log ==============================

But i see that the GPU usage is varying between 0-99 %, Is this correct ?
Is there a way with WarpX to measure the usage efficiency when running with GPU ?

from warpx.

WeiqunZhang commented on July 30, 2024

ROCm has a profiling tool. You can try that.

https://docs.amd.com/bundle/ROCm-Profiling-Tools-User-Guide-v5.3/page/Introduction_to_ROCm_Profiling_Tools_User_Guide.html

from warpx.

ax3l commented on July 30, 2024

@denisbertini just curious about the Singularity container, which I used before.

Without the patch above that @WeiqunZhang suggests, doesn't one usually start them as:

$ srun -n <NUMBER_OF_RANKS> singularity exec <PATH/TO/MY/IMAGE.sif> </PATH/TO/BINARY/WITHIN/CONTAINER>

https://docs.sylabs.io/guides/3.3/user-guide/mpi.html

So in your case

export CONT=/cvmfs/phelix.gsi.de/sifs/gpu/rlx8_rocm-5.4.3_warpx.sif
export WDIR=/lustre/rz/dbertini/gpu/warpx

export OMPI_MCA_io=romio321
export APPTAINER_BINDPATH=/lustre/rz/dbertini/,/cvmfs

srun --export=ALL singularity exec $CONT warpx_2d $WDIR/scripts/inputs/warpx_opmd_deck

from warpx.

ax3l commented on July 30, 2024

Awesome. One hint for parallel sims, you can also try out our dynamic load balancing capabilties.

With the AMReX block size, you can aim to create 4-12 blocks per GPU so the algorithm can move them around based on the cost function you pick. (Of course, your problem needs to be large enough to not underutilize the GPUs with too little work.) Generally, the Knapsack distribution works well and you can use CPU and GPU timers or heuristics for cost estimates.

https://warpx.readthedocs.io/en/latest/usage/domain_decomposition.html
https://warpx.readthedocs.io/en/latest/usage/parameters.html#distribution-across-mpi-ranks-and-parallelization
Rowan ME, Gott KN, Deslippe J, Huebl A, Thevenet M, Lehe R, Vay JL. In-situ assessment of device-side compute work for dynamic load balancing in a GPU-accelerated PIC code. PASC ‘21: Proceedings of the Platform for Advanced Scientific Computing Conference. 2021 July, 10, pages 1-11. DOI:10.1145/3468267.3470614

Is your original issue addressed? We could continue in new issues if this is all set now :)

from warpx.

ax3l commented on July 30, 2024

@WeiqunZhang just checking, are you working on a related AMReX PR that we should link & track? :)

@denisbertini awesome 🤩 moved to #3994 to keep things organized :)

from warpx.

WeiqunZhang commented on July 30, 2024

@denisbertini Could you please give AMReX-Codes/amrex#3382 a try and let us know if it works for you?

from warpx.

running on multiGPU machine about warpx HOT 44 OPEN

Comments (44)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent