Giter VIP home page Giter VIP logo

Comments (5)

RemiLehe avatar RemiLehe commented on July 30, 2024

Thanks for reporting this.

I think that the fact that the ES solver does not scale as well as the EM solver is indeed expected. The ES solver does require more MPI communications than the EM solver, and your observations are in line with what other WarpX users have seen when trying to scale the ES solver with multiple GPUs.

Nevertheless, it might still be possible to find ways to improve the scaling. One thing you could try is to set amrex.use_gpu_aware_mpi=1, as this could potentially speed up the GPU-to-GPU MPI communications. Ah, I just saw that you are already using this.

Additionally, it could be helpful if you can post the TPROF output (at the end of the WarpX simulation) for e.g. the two-GPU simulation, just to confirm that most of the time is being spent in the Poisson solver.
If you have the time, it could also be interesting to use the NVIDIA profiler to check where the code is spending most of its time.

I also know that @pmessmer is interested in speeding up the ES solver in WarpX ; maybe he'd have some suggestions.

from warpx.

RemiLehe avatar RemiLehe commented on July 30, 2024

Btw, @archermarx when attempting to run the Python script that you posted (but with numprocs = [1,1,1]), I get:

MLMG: Iteration 197 Fine resid/bnorm = 0.9891505021
MLMG: Iteration 198 Fine resid/bnorm = 0.9891505021
MLMG: Iteration 199 Fine resid/bnorm = 0.9891505021
MLMG: Iteration 200 Fine resid/bnorm = 0.9891505021
MLMG: Failed to converge after 200 iterations. resid, resid/bnorm = 287945.6678, 0.9891505021
amrex::Abort::0::MLMG failed. !!!

at the first iteration.

Is that your case too? Or am I missing something (e.g. are you compiling a modified/older version of WarpX? or are you using non-default compiler flags?)

from warpx.

archermarx avatar archermarx commented on July 30, 2024

Hi Remi,

No, running on one proc, this runs to completion on my end. My compiler options are listed below. The only thing non-default I'm using (I think) is single-precision particles. I'm running on WarpX v24.07

# Build warpx
cmake -S . -B build \
        -DWarpX_LIB=ON \
        -DWarpX_APP=OFF \
        -DWarpX_MPI=ON \
        -DWarpX_COMPUTE=CUDA \
        -DWarpX_DIMS="1;2;3" \
        -DWarpX_PYTHON=ON \
        -DWarpX_PRECISION=DOUBLE \
        -DWarpX_PARTICLE_PRECISION=SINGLE

cmake --build build --target pip_install -j 20

from warpx.

archermarx avatar archermarx commented on July 30, 2024

EDIT: issue resolved

from warpx.

archermarx avatar archermarx commented on July 30, 2024

After resolving some issues, I have more realistic scaling results. Not nearly as bad as before, but still suboptimal. First, I show the speedup over 1 GPU for different workloads on 1, 2, 4, and 8 GPUs:

image

Next, I show how the speedup grows as a function of workload

image

TinyProf insights

I've attached tinyprof output for 1 GPU and 8 GPU to this file. Here are some of the main insights:

  • With GPU, one Evolve step spends about 30% of its time on the field solve, 45% on gather and push, and 22% on deposition
  • With 8 GPUs , we spend 60% of the time on field solve, 20% on gather and push, and 20% on deposition
  • With 8 GPUs, we spend nearly 45% of the total runtime on the following three functions, which barely register at all in the 1 GPU case
FillBoundary_nowait()                                 392555      8.867      10.11      14.23  17.20%
FabArray::ParallelCopy_finish()                        31000      1.413      10.11      11.51  13.92%
FillBoundary_finish()                                 392555      9.539      10.58      11.33  13.69%

This is a huge fraction. Any idea how to speed this up?


tinyprof_1gpu.txt
tinyprof_8gpu.txt
picmi.txt
warpx_inputs.txt

from warpx.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.