Summary When replicating a molecular system (atom_s

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Here is a permalink to the code I referenced: <div class="Box Box--condensed m

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

This issue would be well fixed by <a class="issue-link js-issue-link" data-error-text=

[BUG] Major performance regression in Special.build() component of replicate at large scale (512 nodes+) about lammps HOT 14 CLOSED

hagertnl commented on June 18, 2024

[BUG] Major performance regression in Special.build() component of replicate at large scale (512 nodes+)

from lammps.

Comments (14)

akohlmey commented on June 18, 2024 2

I also tried explicitly with atom_modify map array and got a crash:

This has to crash. You don't have enough RAM and indexing with signed 32bit integers will fail. LAMMPS defaults to using map style array until about 1 Million atoms and then it switches to map style hash. Unless a user make a different selection with the atom_modify command...

from lammps.

stanmoore1 commented on June 18, 2024 1

@hagertnl I think this is more a suboptimal use of Kokkos::BinSort, were it becomes inefficient due to having large bins on rank 0 due to the huge difference in the min and max tags. Using sort_by_key would not have the same problem because there are no bins. I will need to modify the code to add a new package option to build the map on the CPU.

from lammps.

hagertnl commented on June 18, 2024

Here is a permalink to the code I referenced:

lammps/src/KOKKOS/atom_map_kokkos.cpp

Lines 198 to 200 in c6a8f1f

 mapSorter.create_permute_vector(LMPDeviceType()); 

 mapSorter.sort(LMPDeviceType(), d_tag_sorted, 0, nall); 

 mapSorter.sort(LMPDeviceType(), d_i_sorted, 0, nall);

I also did some debugging within this map_set method and found that the values of min and max are drastically different between rank 0 and every other rank -- rank 0's min/max is 1 and the largest atom tag in the global system, while every other rank seems to have a much smaller min max.
For example, in a run with 25 billion atoms spread across just 32 nodes x 56 ranks per node on Frontier (it fits in memory at this stage, I'm not endorsing this as a good idea though), which is 14 mil atoms per rank, here are some reported min/max's and nall values:

Rank 0: nall=14377536, nmax=15826944, min=1, max=25769799246
Rank 1: nall=14377536, nmax=15826944, min=1207960849, max=3618372585
Rank 2: nall=14377536, nmax=15826944, min=2818573585, max=5228985321

nall is responsible for the map lengths that are being sorted, it seems, so I was looking to see if it was different for rank 0 or not. But max-min for ranks 1 and 2 is 10x less than rank 0.

from lammps.

stanmoore1 commented on June 18, 2024

@hagertnl sorry to hear about the slowdown. Just to be sure, you are using a "hash" style atom map, correct? This looks like a bug; rank 0 should not be trying to sort 25 billion tags all by itself. I will take a look.

from lammps.

hagertnl commented on June 18, 2024

Hi @stanmoore1 , no problem, I know this is an edge case from the most common LAMMPS usage. I typically let the atom map style default, and with the base system at 98k atoms, it sounds like it probably defaults to array.

That said, I re-ran the 4096-node GPU-Kokkos problem that took 420 seconds with atom_modify map hash explicitly set after atom_style full and it took 420 seconds on the nose.

I also tried explicitly with atom_modify map array and got a crash:

(CudaInternal::singleton().cuda_device_synchronize_wrapper()) error( cudaErrorIllegalAddress): an illegal memory access was encountered ../../lib/kokkos/core/src/Cuda/Kokkos_Cuda_Instance.cpp:154

from lammps.

stanmoore1 commented on June 18, 2024

This issue would be well fixed by kokkos/kokkos#6801 bypassing Kokkos::BinSort, but it won't be released for a while. In the meantime, it may make sense to just have an option to build the atom map on the CPU instead of GPU.

from lammps.

hagertnl commented on June 18, 2024

How would be best to do that? Last time I tried that, I got errors saying that the Kokkos-enabled atom map must be used with the Kokkos pair styles and fixes, so I couldn't for example turn the suffix off around atom_style. I may have been doing something wrong.

So this looks like a Kokkos performance bug, not any issue with LAMMPS source?

from lammps.

hagertnl commented on June 18, 2024

Do you have an idea of timeline of when that might be available to try out? We have an internal deadline approaching, I might just need to revert back to the older source before the changes and cherry-pick some of the integer overflow fixes we need.

Thanks for all your help on this!

from lammps.

hagertnl commented on June 18, 2024

Or, is there a patch we can put in place that forces the atom map to be on the CPU in the meantime until a nice package option becomes available?

from lammps.

stanmoore1 commented on June 18, 2024

I will look at the code today.

from lammps.

stanmoore1 commented on June 18, 2024

@hagertnl just curious, when you use 32 nodes x 56 ranks per node what is your MPI decomposition output from LAMMPS? E.g. 2 by 2 by 2 MPI processor grid for 8 ranks.

from lammps.

hagertnl commented on June 18, 2024

@stanmoore1 8 by 14 by 16 MPI processor grid

from lammps.

stanmoore1 commented on June 18, 2024

I still don't understand why rank 0 has tags ranging from 1 to N with that decomposition at initialization, but I think the issue would be similar after running for a while and atoms migrate around and a proc by chance gets atoms with both high and low tags, which seems to make Kokkos::BinSort run very inefficiently. There are some other Kokkos sorting options but none support AMD/HIP yet (see kokkos/kokkos#6793). For Frontier, I think just reverting back to the host code is the best temporary workaround. I will try to submit a PR in a bit.

from lammps.

hagertnl commented on June 18, 2024

The decomposition at 4096 Summit nodes is 24 by 32 by 32 MPI processor grid, if that's helpful to know.

I didn't understand that either. I was wondering if it had something to do with the way the replicate worked out, where rank 0 owned the lowest numbered atoms and happened to also own the highest numbered atoms, too. But I hadn't looked into it that much.

For the context of our current needs, just forcing all that back to the CPU is good with me.

from lammps.

[BUG] Major performance regression in Special.build() component of replicate at large scale (512 nodes+) about lammps HOT 14 CLOSED

Comments (14)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	mapSorter.create_permute_vector(LMPDeviceType());
	mapSorter.sort(LMPDeviceType(), d_tag_sorted, 0, nall);
	mapSorter.sort(LMPDeviceType(), d_i_sorted, 0, nall);