EDIT: see more recent results (with profiling) <a href="https://github.com/ECP

Btw, <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url

Poor scaling to multiple GPUs with electrostatic solver about warpx HOT 5 OPEN

archermarx commented on July 30, 2024

Poor scaling to multiple GPUs with electrostatic solver

from warpx.

Comments (5)

RemiLehe commented on July 30, 2024

Thanks for reporting this.

I think that the fact that the ES solver does not scale as well as the EM solver is indeed expected. The ES solver does require more MPI communications than the EM solver, and your observations are in line with what other WarpX users have seen when trying to scale the ES solver with multiple GPUs.

Nevertheless, it might still be possible to find ways to improve the scaling. ~~One thing you could try is to set amrex.use_gpu_aware_mpi=1, as this could potentially speed up the GPU-to-GPU MPI communications.~~ Ah, I just saw that you are already using this.

Additionally, it could be helpful if you can post the TPROF output (at the end of the WarpX simulation) for e.g. the two-GPU simulation, just to confirm that most of the time is being spent in the Poisson solver.
If you have the time, it could also be interesting to use the NVIDIA profiler to check where the code is spending most of its time.

I also know that @pmessmer is interested in speeding up the ES solver in WarpX ; maybe he'd have some suggestions.

from warpx.

RemiLehe commented on July 30, 2024

Btw, @archermarx when attempting to run the Python script that you posted (but with numprocs = [1,1,1]), I get:

MLMG: Iteration 197 Fine resid/bnorm = 0.9891505021
MLMG: Iteration 198 Fine resid/bnorm = 0.9891505021
MLMG: Iteration 199 Fine resid/bnorm = 0.9891505021
MLMG: Iteration 200 Fine resid/bnorm = 0.9891505021
MLMG: Failed to converge after 200 iterations. resid, resid/bnorm = 287945.6678, 0.9891505021
amrex::Abort::0::MLMG failed. !!!

at the first iteration.

Is that your case too? Or am I missing something (e.g. are you compiling a modified/older version of WarpX? or are you using non-default compiler flags?)

from warpx.

archermarx commented on July 30, 2024

Hi Remi,

No, running on one proc, this runs to completion on my end. My compiler options are listed below. The only thing non-default I'm using (I think) is single-precision particles. I'm running on WarpX v24.07

# Build warpx
cmake -S . -B build \
        -DWarpX_LIB=ON \
        -DWarpX_APP=OFF \
        -DWarpX_MPI=ON \
        -DWarpX_COMPUTE=CUDA \
        -DWarpX_DIMS="1;2;3" \
        -DWarpX_PYTHON=ON \
        -DWarpX_PRECISION=DOUBLE \
        -DWarpX_PARTICLE_PRECISION=SINGLE

cmake --build build --target pip_install -j 20

from warpx.

archermarx commented on July 30, 2024

EDIT: issue resolved

from warpx.

archermarx commented on July 30, 2024

After resolving some issues, I have more realistic scaling results. Not nearly as bad as before, but still suboptimal. First, I show the speedup over 1 GPU for different workloads on 1, 2, 4, and 8 GPUs:

Next, I show how the speedup grows as a function of workload

TinyProf insights

I've attached tinyprof output for 1 GPU and 8 GPU to this file. Here are some of the main insights:

With GPU, one Evolve step spends about 30% of its time on the field solve, 45% on gather and push, and 22% on deposition
With 8 GPUs , we spend 60% of the time on field solve, 20% on gather and push, and 20% on deposition
With 8 GPUs, we spend nearly 45% of the total runtime on the following three functions, which barely register at all in the 1 GPU case

FillBoundary_nowait()                                 392555      8.867      10.11      14.23  17.20%
FabArray::ParallelCopy_finish()                        31000      1.413      10.11      11.51  13.92%
FillBoundary_finish()                                 392555      9.539      10.58      11.33  13.69%

This is a huge fraction. Any idea how to speed this up?

tinyprof_1gpu.txt
tinyprof_8gpu.txt
picmi.txt
warpx_inputs.txt

from warpx.

Poor scaling to multiple GPUs with electrostatic solver about warpx HOT 5 OPEN

Comments (5)

TinyProf insights

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent