Comments (14)
I also tried explicitly with
atom_modify map array
and got a crash:
This has to crash. You don't have enough RAM and indexing with signed 32bit integers will fail. LAMMPS defaults to using map style array until about 1 Million atoms and then it switches to map style hash. Unless a user make a different selection with the atom_modify command...
from lammps.
@hagertnl I think this is more a suboptimal use of Kokkos::BinSort
, were it becomes inefficient due to having large bins on rank 0 due to the huge difference in the min and max tags. Using sort_by_key
would not have the same problem because there are no bins. I will need to modify the code to add a new package option to build the map on the CPU.
from lammps.
Here is a permalink to the code I referenced:
lammps/src/KOKKOS/atom_map_kokkos.cpp
Lines 198 to 200 in c6a8f1f
I also did some debugging within this map_set method and found that the values of min
and max
are drastically different between rank 0 and every other rank -- rank 0's min/max is 1 and the largest atom tag in the global system, while every other rank seems to have a much smaller min max.
For example, in a run with 25 billion atoms spread across just 32 nodes x 56 ranks per node on Frontier (it fits in memory at this stage, I'm not endorsing this as a good idea though), which is 14 mil atoms per rank, here are some reported min/max's and nall
values:
Rank 0: nall=14377536, nmax=15826944, min=1, max=25769799246
Rank 1: nall=14377536, nmax=15826944, min=1207960849, max=3618372585
Rank 2: nall=14377536, nmax=15826944, min=2818573585, max=5228985321
nall
is responsible for the map lengths that are being sorted, it seems, so I was looking to see if it was different for rank 0 or not. But max-min for ranks 1 and 2 is 10x less than rank 0.
from lammps.
@hagertnl sorry to hear about the slowdown. Just to be sure, you are using a "hash" style atom map, correct? This looks like a bug; rank 0 should not be trying to sort 25 billion tags all by itself. I will take a look.
from lammps.
Hi @stanmoore1 , no problem, I know this is an edge case from the most common LAMMPS usage. I typically let the atom map style default, and with the base system at 98k atoms, it sounds like it probably defaults to array.
That said, I re-ran the 4096-node GPU-Kokkos problem that took 420 seconds with atom_modify map hash
explicitly set after atom_style full
and it took 420 seconds on the nose.
I also tried explicitly with atom_modify map array
and got a crash:
(CudaInternal::singleton().cuda_device_synchronize_wrapper()) error( cudaErrorIllegalAddress): an illegal memory access was encountered ../../lib/kokkos/core/src/Cuda/Kokkos_Cuda_Instance.cpp:154
from lammps.
This issue would be well fixed by kokkos/kokkos#6801 bypassing Kokkos::BinSort
, but it won't be released for a while. In the meantime, it may make sense to just have an option to build the atom map on the CPU instead of GPU.
from lammps.
How would be best to do that? Last time I tried that, I got errors saying that the Kokkos-enabled atom map must be used with the Kokkos pair styles and fixes, so I couldn't for example turn the suffix off around atom_style
. I may have been doing something wrong.
So this looks like a Kokkos performance bug, not any issue with LAMMPS source?
from lammps.
Do you have an idea of timeline of when that might be available to try out? We have an internal deadline approaching, I might just need to revert back to the older source before the changes and cherry-pick some of the integer overflow fixes we need.
Thanks for all your help on this!
from lammps.
Or, is there a patch we can put in place that forces the atom map to be on the CPU in the meantime until a nice package option becomes available?
from lammps.
I will look at the code today.
from lammps.
@hagertnl just curious, when you use 32 nodes x 56 ranks per node
what is your MPI decomposition output from LAMMPS? E.g. 2 by 2 by 2 MPI processor grid
for 8 ranks.
from lammps.
@stanmoore1 8 by 14 by 16 MPI processor grid
from lammps.
I still don't understand why rank 0 has tags ranging from 1 to N
with that decomposition at initialization, but I think the issue would be similar after running for a while and atoms migrate around and a proc by chance gets atoms with both high and low tags, which seems to make Kokkos::BinSort
run very inefficiently. There are some other Kokkos sorting options but none support AMD/HIP yet (see kokkos/kokkos#6793). For Frontier, I think just reverting back to the host code is the best temporary workaround. I will try to submit a PR in a bit.
from lammps.
The decomposition at 4096 Summit nodes is 24 by 32 by 32 MPI processor grid, if that's helpful to know.
I didn't understand that either. I was wondering if it had something to do with the way the replicate worked out, where rank 0 owned the lowest numbered atoms and happened to also own the highest numbered atoms, too. But I hadn't looked into it that much.
For the context of our current needs, just forcing all that back to the CPU is good with me.
from lammps.
Related Issues (20)
- LAMMPS-GPU Benchmark-Cuda driver error 4 in call at file โgeryon/nvd_device.h HOT 4
- [BUG] Cmake build: Unittest unrecognized command line option when enabling KOKKOS Cuda HOT 1
- [BUG] MPI errors when running NEB with multiple cores per replica HOT 5
- [Feature Request] Mixed Precision GPU Reaxff/Kokkos Package
- [Feature Request] Refactor access to and error messages for atom styles
- [BUG] Force spike for some ReaxFF parametrizations for bond order ~10^-8 HOT 18
- [BUG] Technically Timer::get_timeout_remain is buggy (can return timer inactive when it has just expired) HOT 5
- [BUG] Kokkos CMake config deprecated HOT 7
- [BUG] Release branch fails to build with EXTRA-PAIR HOT 3
- [BUG] Vector variable interaction with loop variable. HOT 2
- LAMMPS installtion still required the deprecated and removed Intel classic compilers (icc, icpc) HOT 1
- [BUG] Compute RDF unphysical with neighbor multi HOT 13
- [BUG] Memory leak of lammps_gather_concat functions in library.cpp HOT 2
- [BUG] Fortran interface, c2f_string shape HOT 1
- [BUG] Small error in in.ttm.mod example (Si.ttm_mod)
- [BUG] _Unable to restart from restart files with rigid molecules.
- [BUG] Replace XDR code with BSD-relicensed copy
- An error in the "fix ttm command" description in the LAMMPS manual HOT 1
- Implementing Custom pair_style in KOKKOS HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. ๐๐๐
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google โค๏ธ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from lammps.