Hi, When running electrostatic simulations on NVIDIA H100 nodes (fur

Bug: Invalid memory access occurred during AMReX::GpuDevice::streamSynchronize about warpx HOT 10 CLOSED

archermarx commented on July 30, 2024

Bug: Invalid memory access occurred during AMReX::GpuDevice::streamSynchronize

from warpx.

Comments (10)

WeiqunZhang commented on July 30, 2024

amrex::Gpu::streamSynchronize waits for async GPU kernels launched earlier to finish and then checks if there are any errors in those earlier kernels reported by the CUDA runtime.

Could you provide an inputs file for C++ that I can test without using python?

from warpx.

archermarx commented on July 30, 2024

Certainly, here's the file generated by the python interface, with some apparently duplicated outputs removed.

max_step = 1000
warpx.verbose = 1
warpx.const_dt = 5e-12
warpx.numprocs = 2 2 2

warpx.do_electrostatic = labframe
warpx.self_fields_required_precision = 1e-05

amr.n_cell = 128 128 128
amr.max_grid_size = 32
amr.blocking_factor = 1
amr.max_level = 0
amrex.use_gpu_aware_mpi = 1

geometry.dims = 3
geometry.prob_lo = -2e-05 -2e-05 -2e-05
geometry.prob_hi = 2e-05 2e-05 2e-05
geometry.is_periodic = 1 1 1
geometry.coord_sys = 0

boundary.field_lo = periodic periodic periodic
boundary.field_hi = periodic periodic periodic
boundary.particle_lo = periodic periodic periodic
boundary.particle_hi = periodic periodic periodic

algo.current_deposition = direct
algo.particle_shape = 1
particles.species_names = electrons

electrons.mass = m_e
electrons.charge = -q_e
electrons.injection_style = nuniformpercell
electrons.initialize_self_fields = 0
electrons.num_particles_per_cell_each_dim = 8 8 8
electrons.xmin = -2e-05
electrons.xmax = 2e-05
electrons.ymin = -2e-05
electrons.ymax = 2e-05
electrons.zmin = -2e-05
electrons.zmax = 2e-05
electrons.momentum_distribution_type = gaussian
electrons.ux_m = 0.0
electrons.uy_m = 0.0
electrons.uz_m = 0.0
electrons.ux_th = 0.01
electrons.uy_th = 0.01
electrons.uz_th = 0.01
electrons.profile = constant
electrons.density = 1e+25

amrex.abort_on_out_of_gpu_memory = 1
amrex.the_arena_is_managed = 0
amrex.omp_threads = nosmt
tiny_profiler.device_synchronize_around_region = 1
particles.do_tiling = 0
amrex.v = 1
amrex.verbose = 1
amrex.max_gpu_streams = 4
device.v = 0
device.verbose = 0
device.numThreads.x = 0
device.numThreads.y = 0
device.numThreads.z = 0
device.numBlocks.x = 0
device.numBlocks.y = 0
device.numBlocks.z = 0
device.graph_init = 0
device.graph_init_nodes = 10000
amrex.regtest_reduction = 0
amrex.signal_handling = 1
amrex.throw_exception = 0
amrex.call_addr2line = 1
amrex.abort_on_unused_inputs = 0
amrex.handle_sigsegv = 1
amrex.handle_sigterm = 0
amrex.handle_sigint = 1
amrex.handle_sigabrt = 1
amrex.handle_sigfpe = 1
amrex.handle_sigill = 1
amrex.fpe_trap_invalid = 0
amrex.fpe_trap_zero = 0
amrex.fpe_trap_overflow = 0
amrex.the_arena_init_size = 63697108992
amrex.the_device_arena_init_size = 8388608
amrex.the_managed_arena_init_size = 8388608
amrex.the_pinned_arena_init_size = 8388608
amrex.the_comms_arena_init_size = 8388608
amrex.the_arena_release_threshold = 9223372036854775807
amrex.the_device_arena_release_threshold = 9223372036854775807
amrex.the_managed_arena_release_threshold = 9223372036854775807
amrex.the_pinned_arena_release_threshold = 42464739328
amrex.the_comms_arena_release_threshold = 9223372036854775807
amrex.the_async_arena_release_threshold = 9223372036854775807
fab.init_snan = 0
DistributionMapping.v = 0
DistributionMapping.verbose = 0
DistributionMapping.efficiency = 0.90000000000000002
DistributionMapping.sfc_threshold = 0
DistributionMapping.node_size = 0
DistributionMapping.verbose_mapper = 0
fab.initval = nan
fab.do_initval = 0
fabarray.maxcomp = 25
amrex.mf.alloc_single_chunk = 0
vismf.v = 0
vismf.headerversion = 1
vismf.groupsets = 0
vismf.setbuf = 1
vismf.usesingleread = 0
vismf.usesinglewrite = 0
vismf.checkfilepositions = 0
vismf.usepersistentifstreams = 0
vismf.usesynchronousreads = 0
vismf.usedynamicsetselection = 1
vismf.iobuffersize = 2097152
vismf.allowsparsewrites = 1
amrex.async_out = 0
amrex.async_out_nfiles = 64
amrex.vector_growth_factor = 1.5
machine.verbose = 0
machine.very_verbose = 0
tiny_profiler.verbose = 0
tiny_profiler.v = 0
tiny_profiler.print_threshold = 1
amrex.use_profiler_syncs = 0

amr.v = 0
amr.n_proper = 1
amr.grid_eff = 0.69999999999999996
amr.refine_grid_layout = 1
amr.refine_grid_layout_x = 1
amr.refine_grid_layout_y = 1
amr.refine_grid_layout_z = 1
amr.check_input = 1
vismf.usesingleread = 1
vismf.usesinglewrite = 1
particles.particles_nfiles = 1024
particles.use_prepost = 0
particles.do_unlink = 1
particles.do_mem_efficient_sort = 1
lattice.reverse = 0

from warpx.

WeiqunZhang commented on July 30, 2024

I can see an issue. For the first multigrid solver, the min and max of the rhs are 1.8095128179727603e+17, and 1.8095128179728016e+17. Since the problem has all periodic boundaries, the matrix is singular. The solver is having trouble with this singular matrix problem with rhs being almost a constant.

I don't know if this is the issue you are seeing. I only tested it with a much smaller setup on a single CPU core. Could you try with the following change? If the hack works, we can then discuss how to implement a real fix.

--- a/Source/ablastr/fields/PoissonSolver.H
+++ b/Source/ablastr/fields/PoissonSolver.H
@@ -265,8 +265,19 @@ computePhi (amrex::Vector<amrex::MultiFab*> const & rho,
         mlmg.setAlwaysUseBNorm(always_use_bnorm);
 
         // Solve Poisson equation at lev
-        mlmg.solve( {phi[lev]}, {rho[lev]},
-                    relative_tolerance, absolute_tolerance );
+        auto rhomin = rho[lev]->min(0);
+        auto rhomax = rho[lev]->max(0);
+        {
+            amrex::Print().SetPrecision(17) << "xxxxx " << rhomin
+                                            << ", " << rhomax
+                                            << ", " << (rhomax-rhomin)/rhomin << std::endl;
+        }
+        if (std::abs(rhomax-rhomin) <= 1.e-12_rt * std::abs(rhomax+rhomin)) {
+            phi[lev]->setVal(0.0_rt);
+        } else {
+            mlmg.solve( {phi[lev]}, {rho[lev]},
+                        relative_tolerance, absolute_tolerance );
+        }
 
         // needed for solving the levels by levels:
         // - coarser level is initial guess for finer level

from warpx.

WeiqunZhang commented on July 30, 2024

In this test, there are no initial fields because all particles are uniformly distributed. But particles have ux_th = 0.01, uy_th = 0.01 and uz_th = 0.01. The test uses warpx.const_dt = 5e-12. So in one step, rms velocity will move a particle by 1.5e-5. The domain size is only 4e-5. So some particles are probably way out of bound such that one periodic shift will not bring them back to the initial domain. This will result in out of bound access to arrays, thus the invalid memory access.

Maybe you need to use a smaller const_dt. Maybe you need to change the initial setup. Maybe WarpX needs to implement a way to allow for the simulation to start with a smaller dt and then gradually increase to warpx.const_dt. @RemiLehe

from warpx.

archermarx commented on July 30, 2024

Oh interesting thoughts. I hadn't considered that last point. I'll give it a try and report back

from warpx.

archermarx commented on July 30, 2024

Looks like reducing the timestep fixed the primary issue, thanks! I didn't notice much of a change by implementing your precision fix, but I have run into problems with this sort of thing before. It's quite common to initialize domains with uniform plasmas which may then become excited by a perturbation. It would be good if the ES solver could handle uniform plasmas gracefully. I noticed that in the first ten iterations of these uniform plasma tests, each timestep took between 2 and 10 seconds, versus 0.4 seconds per step once the simulation had progressed a bit. I suspect this may be related.

Unfortunately, I am still having issues with the simulation not finalizing. It hangs just before outputting the expected "AMReX finalized" when running the ES simulation, but not when running EM.

from warpx.

WeiqunZhang commented on July 30, 2024

I cannot reproduce the hang before "amrex finalized". Maybe it's in the python part?

from warpx.

archermarx commented on July 30, 2024

Unfortunately not. Running it with the binaries directly still exhibits this problem on my system. I will try running with CPU only later to see if that is the issue.

from warpx.

WeiqunZhang commented on July 30, 2024

We need to figure out where it hangs. If your job is interactive, pressing ctrl-c might produce backtrace files. If it's a batch job, you need to send a signal to the hanging job to terminal the job and hopefully backtrace files are then produced. If it's slurm, you could use scancel --signal=INT or when you submit job you do sbatch --signal=INT@300 (where 300 means sending SIGINT 300 seconds before the time limit.

If you are using cmake to build the code, you probably want to build it with RelWithDebInfo, not Release so that you can get some debug information into the executable.

I don't understand how python handles signals. Maybe you need to run the executable directly without python in the middle for the signal handling stuff work properly.

from warpx.

archermarx commented on July 30, 2024

OK, I found that reducing the timestep further to 1e-14 seconds fixes all of the problems, seemingly. It might be nice to emit a warning if the user has picked a timestep that is likely to result in problems. I can make a PR to do that, if that seems reasonable.

from warpx.

Bug: Invalid memory access occurred during AMReX::GpuDevice::streamSynchronize about warpx HOT 10 CLOSED

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent