Comments (5)
Thanks for reporting this.
I think that the fact that the ES solver does not scale as well as the EM solver is indeed expected. The ES solver does require more MPI communications than the EM solver, and your observations are in line with what other WarpX users have seen when trying to scale the ES solver with multiple GPUs.
Nevertheless, it might still be possible to find ways to improve the scaling. One thing you could try is to set Ah, I just saw that you are already using this.amrex.use_gpu_aware_mpi=1
, as this could potentially speed up the GPU-to-GPU MPI communications.
Additionally, it could be helpful if you can post the TPROF
output (at the end of the WarpX simulation) for e.g. the two-GPU simulation, just to confirm that most of the time is being spent in the Poisson solver.
If you have the time, it could also be interesting to use the NVIDIA profiler to check where the code is spending most of its time.
I also know that @pmessmer is interested in speeding up the ES solver in WarpX ; maybe he'd have some suggestions.
from warpx.
Btw, @archermarx when attempting to run the Python script that you posted (but with numprocs = [1,1,1]
), I get:
MLMG: Iteration 197 Fine resid/bnorm = 0.9891505021
MLMG: Iteration 198 Fine resid/bnorm = 0.9891505021
MLMG: Iteration 199 Fine resid/bnorm = 0.9891505021
MLMG: Iteration 200 Fine resid/bnorm = 0.9891505021
MLMG: Failed to converge after 200 iterations. resid, resid/bnorm = 287945.6678, 0.9891505021
amrex::Abort::0::MLMG failed. !!!
at the first iteration.
Is that your case too? Or am I missing something (e.g. are you compiling a modified/older version of WarpX? or are you using non-default compiler flags?)
from warpx.
Hi Remi,
No, running on one proc, this runs to completion on my end. My compiler options are listed below. The only thing non-default I'm using (I think) is single-precision particles. I'm running on WarpX v24.07
# Build warpx
cmake -S . -B build \
-DWarpX_LIB=ON \
-DWarpX_APP=OFF \
-DWarpX_MPI=ON \
-DWarpX_COMPUTE=CUDA \
-DWarpX_DIMS="1;2;3" \
-DWarpX_PYTHON=ON \
-DWarpX_PRECISION=DOUBLE \
-DWarpX_PARTICLE_PRECISION=SINGLE
cmake --build build --target pip_install -j 20
from warpx.
EDIT: issue resolved
from warpx.
After resolving some issues, I have more realistic scaling results. Not nearly as bad as before, but still suboptimal. First, I show the speedup over 1 GPU for different workloads on 1, 2, 4, and 8 GPUs:
Next, I show how the speedup grows as a function of workload
TinyProf insights
I've attached tinyprof output for 1 GPU and 8 GPU to this file. Here are some of the main insights:
- With GPU, one
Evolve
step spends about 30% of its time on the field solve, 45% on gather and push, and 22% on deposition - With 8 GPUs , we spend 60% of the time on field solve, 20% on gather and push, and 20% on deposition
- With 8 GPUs, we spend nearly 45% of the total runtime on the following three functions, which barely register at all in the 1 GPU case
FillBoundary_nowait() 392555 8.867 10.11 14.23 17.20%
FabArray::ParallelCopy_finish() 31000 1.413 10.11 11.51 13.92%
FillBoundary_finish() 392555 9.539 10.58 11.33 13.69%
This is a huge fraction. Any idea how to speed this up?
tinyprof_1gpu.txt
tinyprof_8gpu.txt
picmi.txt
warpx_inputs.txt
from warpx.
Related Issues (20)
- CUDA initialization failed HOT 5
- CMake: Finalize WarpX/ABLASTR Installer HOT 2
- CI: CUDA RZ PSATD
- NERSC Perlmutter Compilation Error (pre-#4986) HOT 3
- Error recompiling on Lassen HOT 3
- Some basic AMR questions`1 HOT 2
- Adios2 using Blosc2 HOT 3
- PICMI documentation is gone HOT 4
- Issue with Limiting External Electric Field to Specific Boundary in 3D Simulation HOT 8
- Clean code: remove `tmp_particle_data`
- Bug: Invalid memory access occurred during AMReX::GpuDevice::streamSynchronize HOT 10
- Convergence behavior of electrostatic solver changes whether embedded boundary support is on or off HOT 1
- Cannot get FFT method to work for electrostatic solver HOT 4
- NumPy 2.0 Compatibility HOT 5
- Runtime error with Laser Ion acceleration test run HOT 2
- Clean code: avoid duplication in Source/Parallelization/WarpXComm.cpp
- Segfault converting SOA particles to conduit blueprint
- How to utilize WarpX to investigate the evolution of relativistic electron beams in vacuum HOT 2
- Laser injected at an angle having non-physical effects HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from warpx.