Giter VIP home page Giter VIP logo

Comments (14)

akohlmey avatar akohlmey commented on June 18, 2024 2

I also tried explicitly with atom_modify map array and got a crash:

This has to crash. You don't have enough RAM and indexing with signed 32bit integers will fail. LAMMPS defaults to using map style array until about 1 Million atoms and then it switches to map style hash. Unless a user make a different selection with the atom_modify command...

from lammps.

stanmoore1 avatar stanmoore1 commented on June 18, 2024 1

@hagertnl I think this is more a suboptimal use of Kokkos::BinSort, were it becomes inefficient due to having large bins on rank 0 due to the huge difference in the min and max tags. Using sort_by_key would not have the same problem because there are no bins. I will need to modify the code to add a new package option to build the map on the CPU.

from lammps.

hagertnl avatar hagertnl commented on June 18, 2024

Here is a permalink to the code I referenced:

mapSorter.create_permute_vector(LMPDeviceType());
mapSorter.sort(LMPDeviceType(), d_tag_sorted, 0, nall);
mapSorter.sort(LMPDeviceType(), d_i_sorted, 0, nall);

I also did some debugging within this map_set method and found that the values of min and max are drastically different between rank 0 and every other rank -- rank 0's min/max is 1 and the largest atom tag in the global system, while every other rank seems to have a much smaller min max.
For example, in a run with 25 billion atoms spread across just 32 nodes x 56 ranks per node on Frontier (it fits in memory at this stage, I'm not endorsing this as a good idea though), which is 14 mil atoms per rank, here are some reported min/max's and nall values:

Rank 0: nall=14377536, nmax=15826944, min=1, max=25769799246
Rank 1: nall=14377536, nmax=15826944, min=1207960849, max=3618372585
Rank 2: nall=14377536, nmax=15826944, min=2818573585, max=5228985321

nall is responsible for the map lengths that are being sorted, it seems, so I was looking to see if it was different for rank 0 or not. But max-min for ranks 1 and 2 is 10x less than rank 0.

from lammps.

stanmoore1 avatar stanmoore1 commented on June 18, 2024

@hagertnl sorry to hear about the slowdown. Just to be sure, you are using a "hash" style atom map, correct? This looks like a bug; rank 0 should not be trying to sort 25 billion tags all by itself. I will take a look.

from lammps.

hagertnl avatar hagertnl commented on June 18, 2024

Hi @stanmoore1 , no problem, I know this is an edge case from the most common LAMMPS usage. I typically let the atom map style default, and with the base system at 98k atoms, it sounds like it probably defaults to array.

That said, I re-ran the 4096-node GPU-Kokkos problem that took 420 seconds with atom_modify map hash explicitly set after atom_style full and it took 420 seconds on the nose.

I also tried explicitly with atom_modify map array and got a crash:

(CudaInternal::singleton().cuda_device_synchronize_wrapper()) error( cudaErrorIllegalAddress): an illegal memory access was encountered ../../lib/kokkos/core/src/Cuda/Kokkos_Cuda_Instance.cpp:154

from lammps.

stanmoore1 avatar stanmoore1 commented on June 18, 2024

This issue would be well fixed by kokkos/kokkos#6801 bypassing Kokkos::BinSort, but it won't be released for a while. In the meantime, it may make sense to just have an option to build the atom map on the CPU instead of GPU.

from lammps.

hagertnl avatar hagertnl commented on June 18, 2024

How would be best to do that? Last time I tried that, I got errors saying that the Kokkos-enabled atom map must be used with the Kokkos pair styles and fixes, so I couldn't for example turn the suffix off around atom_style. I may have been doing something wrong.

So this looks like a Kokkos performance bug, not any issue with LAMMPS source?

from lammps.

hagertnl avatar hagertnl commented on June 18, 2024

Do you have an idea of timeline of when that might be available to try out? We have an internal deadline approaching, I might just need to revert back to the older source before the changes and cherry-pick some of the integer overflow fixes we need.

Thanks for all your help on this!

from lammps.

hagertnl avatar hagertnl commented on June 18, 2024

Or, is there a patch we can put in place that forces the atom map to be on the CPU in the meantime until a nice package option becomes available?

from lammps.

stanmoore1 avatar stanmoore1 commented on June 18, 2024

I will look at the code today.

from lammps.

stanmoore1 avatar stanmoore1 commented on June 18, 2024

@hagertnl just curious, when you use 32 nodes x 56 ranks per node what is your MPI decomposition output from LAMMPS? E.g. 2 by 2 by 2 MPI processor grid for 8 ranks.

from lammps.

hagertnl avatar hagertnl commented on June 18, 2024

@stanmoore1 8 by 14 by 16 MPI processor grid

from lammps.

stanmoore1 avatar stanmoore1 commented on June 18, 2024

I still don't understand why rank 0 has tags ranging from 1 to N with that decomposition at initialization, but I think the issue would be similar after running for a while and atoms migrate around and a proc by chance gets atoms with both high and low tags, which seems to make Kokkos::BinSort run very inefficiently. There are some other Kokkos sorting options but none support AMD/HIP yet (see kokkos/kokkos#6793). For Frontier, I think just reverting back to the host code is the best temporary workaround. I will try to submit a PR in a bit.

from lammps.

hagertnl avatar hagertnl commented on June 18, 2024

The decomposition at 4096 Summit nodes is 24 by 32 by 32 MPI processor grid, if that's helpful to know.

I didn't understand that either. I was wondering if it had something to do with the way the replicate worked out, where rank 0 owned the lowest numbered atoms and happened to also own the highest numbered atoms, too. But I hadn't looked into it that much.

For the context of our current needs, just forcing all that back to the CPU is good with me.

from lammps.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.