Giter VIP home page Giter VIP logo

Comments (10)

garlick avatar garlick commented on September 26, 2024

The STAT graph is as follows (courtesy @lee218llnl)
Screenshot 2023-09-19 130905

ompi source ref (courtesy Tom)

from flux-core.

garlick avatar garlick commented on September 26, 2024

Hmm I seem to be able to reproduce this pretty consistently with a 3 node run of a simple MPI hello program.

[garlick@corona206:mpi]$ module list

Currently Loaded Modules:
  1) flux_wrappers/0.1       3) gcc/10.3.1-magic
  2) StdEnv            (S)   4) openmpi/4.1.2

  Where:
   S:  Module is Sticky, requires --force to unload or purge
[garlick@corona206:mpi]$ flux run -N3 -o verbose=2 ./hello
f2v32KFEw: completed MPI_Init in 1.379s.  There are 3 tasks
f2v32KFEw: completed first barrier in 0.017s
[snip]
 ^Cflux-job: one more ctrl-C within 2s to cancel or ctrl-Z to detach
^C293.877s: flux-shell[0]: DEBUG: signals: forwarding signal 15 to tasks
293.877s: flux-shell[1]: DEBUG: signals: forwarding signal 15 to tasks
293.877s: job.exception type=cancel severity=0 interrupted by ctrl-C
293.877s: flux-shell[2]: DEBUG: signals: forwarding signal 15 to tasks
293.913s: flux-shell[0]: TRACE: pmi-simple: 0: C: pmi EOF
293.915s: flux-shell[0]: DEBUG: task 0 complete status=143
293.915s: flux-shell[1]: TRACE: pmi-simple: 1: C: pmi EOF
293.918s: flux-shell[1]: DEBUG: task 1 complete status=143
293.927s: flux-shell[1]: DEBUG: exit 143
298.918s: flux-shell[2]: TRACE: pmi-simple: 2: C: pmi EOF
flux-job: task(s) exited with exit code 143

If I attach wtih gdb while it's in the hung state:

rank 0

(gdb) bt
#0  0x00001555544abe87 in epoll_wait () from /lib64/libc.so.6
#1  0x000015554a1e49d9 in ucs_event_set_wait () from /lib64/libucs.so.0
#2  0x00001555500f1e2b in uct_tcp_iface_progress () from /lib64/libuct.so.0
#3  0x000015554a673a2a in ucp_worker_progress () from /lib64/libucp.so.0
#4  0x000015555054d5c4 in opal_common_ucx_wait_all_requests ()
   from /usr/tce/backend/installations/linux-rhel8-x86_64/gcc-10.3.1/openmpi-4.1.2-66kpmbgreoruj2exhnizo3iwcd2boybx/lib/libmca_common_ucx.so.40
#5  0x000015555054e091 in opal_common_ucx_del_procs_nofence ()
   from /usr/tce/backend/installations/linux-rhel8-x86_64/gcc-10.3.1/openmpi-4.1.2-66kpmbgreoruj2exhnizo3iwcd2boybx/lib/libmca_common_ucx.so.40
#6  0x000015555054e0b9 in opal_common_ucx_del_procs ()
   from /usr/tce/backend/installations/linux-rhel8-x86_64/gcc-10.3.1/openmpi-4.1.2-66kpmbgreoruj2exhnizo3iwcd2boybx/lib/libmca_common_ucx.so.40
#7  0x0000155550556889 in mca_pml_ucx_del_procs ()
   from /usr/tce/backend/installations/linux-rhel8-x86_64/gcc-10.3.1/openmpi-4.1.2-66kpmbgreoruj2exhnizo3iwcd2boybx/lib/openmpi/mca_pml_ucx.so
#8  0x000015555546ad82 in ompi_mpi_finalize ()
   from /usr/tce/packages/openmpi/openmpi-4.1.2-gcc-10.3.1/lib/libmpi.so.40
#9  0x0000000000401151 in main (argc=<optimized out>, argv=<optimized out>)
    at mpi/hello.c:53

rank 1

(gdb) bt
#0  0x0000155554ad5ab4 in read () from /lib64/libpthread.so.0
#1  0x0000155552340ed2 in read (__nbytes=1, __buf=0x499370, __fd=17)
    at /usr/include/bits/unistd.h:36
#2  dgetline (fd=17, 
    buf=0x499370 "cmd=get_result rc=0 value=IzOvZq3rrHQJHi703BjYAJC5Qd8yrMwCSIQCELYoGAA  -\n", len=1216) at dgetline.c:24
#3  0x000015555233f319 in pmi_simple_client_barrier (pmi=0x499330)
    at simple_client.c:167
#4  0x0000155552e23db9 in flux_fencenb ()
   from /usr/tce/backend/installations/linux-rhel8-x86_64/gcc-10.3.1/openmpi-4.1.2-66kpmbgreoruj2exhnizo3iwcd2boybx/lib/openmpi/mca_pmix_flux.so
#5  0x000015555054defc in opal_common_ucx_mca_pmix_fence ()
   from /usr/tce/backend/installations/linux-rhel8-x86_64/gcc-10.3.1/openmpi-4.1.2-66kpmbgreoruj2exhnizo3iwcd2boybx/lib/libmca_common_ucx.so.40
#6  0x0000155550556889 in mca_pml_ucx_del_procs ()
   from /usr/tce/backend/installations/linux-rhel8-x86_64/gcc-10.3.1/openmpi-4.1.2-66kpmbgreoruj2exhnizo3iwcd2boybx/lib/openmpi/mca_pml_ucx.so
#7  0x000015555546ad82 in ompi_mpi_finalize ()
   from /usr/tce/packages/openmpi/openmpi-4.1.2-gcc-10.3.1/lib/libmpi.so.40
#8  0x0000000000401151 in main (argc=<optimized out>, argv=<optimized out>)
    at mpi/hello.c:53

rank 2

(gdb) bt
#0  0x0000155554ad5ab4 in read () from /lib64/libpthread.so.0
#1  0x0000155552340ed2 in read (__nbytes=1, __buf=0x4990c0, __fd=17)
    at /usr/include/bits/unistd.h:36
#2  dgetline (fd=17, 
    buf=0x4990c0 "cmd=get_result rc=0 value=IEZTtisloWTRHi703BjYAJC5Qd8yrMwCSIQCE7bY9AA  -\n", len=1216) at dgetline.c:24
#3  0x000015555233f319 in pmi_simple_client_barrier (pmi=0x499080)
    at simple_client.c:167
#4  0x0000155552e23db9 in flux_fencenb ()
   from /usr/tce/backend/installations/linux-rhel8-x86_64/gcc-10.3.1/openmpi-4.1.2-66kpmbgreoruj2exhnizo3iwcd2boybx/lib/openmpi/mca_pmix_flux.so
#5  0x000015555054defc in opal_common_ucx_mca_pmix_fence ()
   from /usr/tce/backend/installations/linux-rhel8-x86_64/gcc-10.3.1/openmpi-4.1.2-66kpmbgreoruj2exhnizo3iwcd2boybx/lib/libmca_common_ucx.so.40
#6  0x0000155550556889 in mca_pml_ucx_del_procs ()
   from /usr/tce/backend/installations/linux-rhel8-x86_64/gcc-10.3.1/openmpi-4.1.2-66kpmbgreoruj2exhnizo3iwcd2boybx/lib/openmpi/mca_pml_ucx.so
#7  0x000015555546ad82 in ompi_mpi_finalize ()
   from /usr/tce/packages/openmpi/openmpi-4.1.2-gcc-10.3.1/lib/libmpi.so.40
#8  0x0000000000401151 in main (argc=<optimized out>, argv=<optimized out>)
    at mpi/hello.c:53

from flux-core.

garlick avatar garlick commented on September 26, 2024

pmix trace of the same 3 node hello run but with -o pmi=pmix which works for some reason. Maybe there's some clue to be found here.

[garlick@corona206:mpi]$ flux run -o verbose=2 -o pmi=pmix -N3 ./hello
0.045s: flux-shell[2]: DEBUG: Loading /etc/flux/shell/initrc.lua
0.045s: flux-shell[2]: TRACE: Successfully loaded flux.shell module
0.045s: flux-shell[2]: TRACE: trying to load /etc/flux/shell/initrc.lua
0.047s: flux-shell[2]: DEBUG: data-staging: Jobspec does not contain data-staging attributes. No staging necessary.
0.048s: flux-shell[2]: DEBUG: pmix: server is enabled
0.048s: flux-shell[2]: TRACE: trying to load /etc/flux/shell/lua.d/ascii.lua
0.051s: flux-shell[2]: TRACE: trying to load /etc/flux/shell/lua.d/intel_mpi.lua
0.051s: flux-shell[2]: TRACE: trying to load /etc/flux/shell/lua.d/mpibind.lua
0.052s: flux-shell[2]: DEBUG: mpibind: disabling cpu-affinity
0.052s: flux-shell[2]: DEBUG: mpibind: disabling gpu-affinity
0.052s: flux-shell[2]: TRACE: trying to load /etc/flux/shell/lua.d/openmpi.lua
0.053s: flux-shell[2]: TRACE: trying to load /etc/flux/shell/lua.d/rocr.lua
0.053s: flux-shell[2]: DEBUG: cpu-affinity: disabling affinity due to cpu-affinity=off
0.053s: flux-shell[2]: DEBUG: gpu-affinity: disabling affinity due to gpu-affinity=off
0.053s: flux-shell[2]: DEBUG: pmix: jobid = 15511811260416
0.053s: flux-shell[2]: DEBUG: pmix: shell_rank = 2
0.053s: flux-shell[2]: DEBUG: pmix: local_nprocs = 1
0.053s: flux-shell[2]: DEBUG: pmix: total_nprocs = 3
0.058s: flux-shell[2]: DEBUG: pmix: local_peers = 2
0.058s: flux-shell[2]: DEBUG: pmix: node_map = corona206,corona207,corona214
0.058s: flux-shell[2]: DEBUG: pmix: proc_map = 0;1;2
0.059s: flux-shell[2]: DEBUG: mpibind: loaded topology from job shell
0.061s: flux-shell[2]: DEBUG: mpibind: flux given cores: 0-47
0.061s: flux-shell[2]: DEBUG: mpibind: 	derived pus: 0-95
0.061s: flux-shell[2]: DEBUG: mpibind: flux given gpus: 0-7
0.061s: flux-shell[2]: DEBUG: mpibind: total #cores: 48
0.061s: flux-shell[2]: DEBUG: mpibind: total #pus: 96
0.061s: flux-shell[2]: DEBUG: mpibind: user opts: ntasks=1 nthreads=0 restrict=0-95 greedy=-1 smt=-1 gpu_optim=-1 verbose=0 master=0 corespec_first=0 corespec_numa=0 corespec_bal=0 visible_devices=0 omp_proc_bind=0 omp_places=0 xml=shell-provided 
0.061s: flux-shell[2]: DEBUG: mpibind: task  0: cpus 0-95
0.061s: flux-shell[2]: DEBUG: 2: task 2 on cores 0-47
0.064s: flux-shell[2]: DEBUG: mpibind: task  0: setting OMP_NUM_THREADS=96
0.064s: flux-shell[2]: DEBUG: mpibind: task  0: setting OMP_PLACES=threads
0.065s: flux-shell[2]: DEBUG: mpibind: task  0: setting OMP_PROC_BIND=spread
0.065s: flux-shell[2]: DEBUG: mpibind: task  0: setting ROCR_VISIBLE_DEVICES=2,3,1,0,6,7,5,4
0.045s: flux-shell[1]: DEBUG: Loading /etc/flux/shell/initrc.lua
0.046s: flux-shell[1]: TRACE: Successfully loaded flux.shell module
0.046s: flux-shell[1]: TRACE: trying to load /etc/flux/shell/initrc.lua
0.048s: flux-shell[1]: DEBUG: data-staging: Jobspec does not contain data-staging attributes. No staging necessary.
0.049s: flux-shell[1]: DEBUG: pmix: server is enabled
0.049s: flux-shell[1]: TRACE: trying to load /etc/flux/shell/lua.d/ascii.lua
0.053s: flux-shell[1]: TRACE: trying to load /etc/flux/shell/lua.d/intel_mpi.lua
0.053s: flux-shell[1]: TRACE: trying to load /etc/flux/shell/lua.d/mpibind.lua
0.054s: flux-shell[1]: DEBUG: mpibind: disabling cpu-affinity
0.054s: flux-shell[1]: DEBUG: mpibind: disabling gpu-affinity
0.054s: flux-shell[1]: TRACE: trying to load /etc/flux/shell/lua.d/openmpi.lua
0.054s: flux-shell[1]: TRACE: trying to load /etc/flux/shell/lua.d/rocr.lua
0.054s: flux-shell[1]: DEBUG: cpu-affinity: disabling affinity due to cpu-affinity=off
0.054s: flux-shell[1]: DEBUG: gpu-affinity: disabling affinity due to gpu-affinity=off
0.054s: flux-shell[1]: DEBUG: pmix: jobid = 15511811260416
0.054s: flux-shell[1]: DEBUG: pmix: shell_rank = 1
0.054s: flux-shell[1]: DEBUG: pmix: local_nprocs = 1
0.054s: flux-shell[1]: DEBUG: pmix: total_nprocs = 3
0.060s: flux-shell[1]: DEBUG: pmix: local_peers = 1
0.060s: flux-shell[1]: DEBUG: pmix: node_map = corona206,corona207,corona214
0.060s: flux-shell[1]: DEBUG: pmix: proc_map = 0;1;2
0.061s: flux-shell[1]: DEBUG: mpibind: loaded topology from job shell
0.063s: flux-shell[1]: DEBUG: mpibind: flux given cores: 0-47
0.063s: flux-shell[1]: DEBUG: mpibind: 	derived pus: 0-95
0.063s: flux-shell[1]: DEBUG: mpibind: flux given gpus: 0-7
0.063s: flux-shell[1]: DEBUG: mpibind: total #cores: 48
0.063s: flux-shell[1]: DEBUG: mpibind: total #pus: 96
0.063s: flux-shell[1]: DEBUG: mpibind: user opts: ntasks=1 nthreads=0 restrict=0-95 greedy=-1 smt=-1 gpu_optim=-1 verbose=0 master=0 corespec_first=0 corespec_numa=0 corespec_bal=0 visible_devices=0 omp_proc_bind=0 omp_places=0 xml=shell-provided 
0.063s: flux-shell[1]: DEBUG: mpibind: task  0: cpus 0-95
0.063s: flux-shell[1]: DEBUG: 1: task 1 on cores 0-47
0.064s: flux-shell[1]: DEBUG: mpibind: task  0: setting OMP_NUM_THREADS=96
0.065s: flux-shell[1]: DEBUG: mpibind: task  0: setting OMP_PLACES=threads
0.065s: flux-shell[1]: DEBUG: mpibind: task  0: setting OMP_PROC_BIND=spread
0.065s: flux-shell[1]: DEBUG: mpibind: task  0: setting ROCR_VISIBLE_DEVICES=2,3,1,0,6,7,5,4
0.043s: flux-shell[0]: DEBUG: Loading /etc/flux/shell/initrc.lua
0.043s: flux-shell[0]: TRACE: Successfully loaded flux.shell module
0.043s: flux-shell[0]: TRACE: trying to load /etc/flux/shell/initrc.lua
0.045s: flux-shell[0]: DEBUG: data-staging: Jobspec does not contain data-staging attributes. No staging necessary.
0.046s: flux-shell[0]: DEBUG: pmix: server is enabled
0.046s: flux-shell[0]: TRACE: trying to load /etc/flux/shell/lua.d/ascii.lua
0.050s: flux-shell[0]: TRACE: trying to load /etc/flux/shell/lua.d/intel_mpi.lua
0.050s: flux-shell[0]: TRACE: trying to load /etc/flux/shell/lua.d/mpibind.lua
0.051s: flux-shell[0]: DEBUG: mpibind: disabling cpu-affinity
0.051s: flux-shell[0]: DEBUG: mpibind: disabling gpu-affinity
0.051s: flux-shell[0]: TRACE: trying to load /etc/flux/shell/lua.d/openmpi.lua
0.051s: flux-shell[0]: TRACE: trying to load /etc/flux/shell/lua.d/rocr.lua
0.052s: flux-shell[0]: DEBUG: output: batch timeout = 0.500s
0.052s: flux-shell[0]: DEBUG: cpu-affinity: disabling affinity due to cpu-affinity=off
0.052s: flux-shell[0]: DEBUG: gpu-affinity: disabling affinity due to gpu-affinity=off
0.052s: flux-shell[0]: DEBUG: pmix: jobid = 15511811260416
0.052s: flux-shell[0]: DEBUG: pmix: shell_rank = 0
0.052s: flux-shell[0]: DEBUG: pmix: local_nprocs = 1
0.052s: flux-shell[0]: DEBUG: pmix: total_nprocs = 3
0.052s: flux-shell[0]: DEBUG: pmix: server outsourced to 3.2.5a1
0.058s: flux-shell[0]: DEBUG: pmix: local_peers = 0
0.058s: flux-shell[0]: DEBUG: pmix: node_map = corona206,corona207,corona214
0.058s: flux-shell[0]: DEBUG: pmix: proc_map = 0;1;2
0.059s: flux-shell[0]: DEBUG: mpibind: loaded topology from job shell
0.061s: flux-shell[0]: DEBUG: mpibind: flux given cores: 0-47
0.061s: flux-shell[0]: DEBUG: mpibind: 	derived pus: 0-95
0.061s: flux-shell[0]: DEBUG: mpibind: flux given gpus: 0-7
0.061s: flux-shell[0]: DEBUG: mpibind: total #cores: 48
0.061s: flux-shell[0]: DEBUG: mpibind: total #pus: 96
0.061s: flux-shell[0]: DEBUG: mpibind: user opts: ntasks=1 nthreads=0 restrict=0-95 greedy=-1 smt=-1 gpu_optim=-1 verbose=0 master=0 corespec_first=0 corespec_numa=0 corespec_bal=0 visible_devices=0 omp_proc_bind=0 omp_places=0 xml=shell-provided 
0.061s: flux-shell[0]: DEBUG: mpibind: task  0: cpus 0-95
0.061s: flux-shell[0]: DEBUG: 0: task_count=3 slot_count=3 cores_per_slot=1 slots_per_node=1
0.061s: flux-shell[0]: DEBUG: 0: task 0 on cores 0-47
0.065s: flux-shell[0]: DEBUG: mpibind: task  0: setting OMP_NUM_THREADS=96
0.065s: flux-shell[0]: DEBUG: mpibind: task  0: setting OMP_PLACES=threads
0.065s: flux-shell[0]: DEBUG: mpibind: task  0: setting OMP_PROC_BIND=spread
0.065s: flux-shell[0]: DEBUG: mpibind: task  0: setting ROCR_VISIBLE_DEVICES=2,3,1,0,6,7,5,4
1.424s: flux-shell[1]: TRACE: pmix: pmix server fence_upcall {"procs":[{"nspace":"f82UCeKLK","rank":-2}],"info":[{"key":"pmix.collect","flags":0,"value":{"type":1,"data":true}}],"data":"AAAAAQAPAAAAAAAAAbAAAAABAQAAAAEADwAAAAAAAAGdAAAAAQAAAAEAAAABAAAAD2J0bC5vcGVuaWIuNC4xAAAbAA8AAAAAAAAANgEAAAAAAACA/lsAAAAFAAAAyQIAABsQAAAAAQM/GACbAABbAAEA/oAAAAAAAAAMQqEDAFqtvAAAAAEAAAAMbXRsLm9maS40LjEAABsADwAAAAAAAAAgA1ybAFsAAAAAAAAAAACA/rytWgADoUIMAAAAAAAAAAAAAAABAAAADHBtbC51Y3guNC4xAAAbAA8AAAAAAAAA3CBmaBrU154oJCESAQIRlO2KyWhZtJgAAPAAAAAAzxkXt1E4U54+S9fgNzcBEyMAgoy/IQYAAsCogM/PGRe3UTgKGi5QI62uNgATIwCCxHshBgACwKhAz88ZF7dROL9T4kx2z8E2ARMjAIKC1ySDAFsAfjl3zCsz70g4UNsqMTUywyMABV6bADEEJ4qwD6Ez70g4ULAPITUy3yIAAHrUd8wrM+9IOFCwDyE1Mt8iAAEAMdFcfOEz70g4UEUdKTUyAycAA2qbALMPd8yrM+9IOFBFHSk1MgMnAINtmwA=","cbfunc":23455932359280,"cbdata":23455756017600}
1.424s: flux-shell[1]: TRACE: pmix: starting pmix exchange 0: size 446
1.440s: flux-shell[2]: TRACE: pmix: pmix server fence_upcall {"procs":[{"nspace":"f82UCeKLK","rank":-2}],"info":[{"key":"pmix.collect","flags":0,"value":{"type":1,"data":true}}],"data":"AAAAAQAPAAAAAAAAAbAAAAABAQAAAAEADwAAAAAAAAGdAAAAAQAAAAIAAAABAAAAD2J0bC5vcGVuaWIuNC4xAAAbAA8AAAAAAAAANgEAAAAAAACA/mIAAAAFAAAAyQIAABsQAAAAAQM/GAJmAQBiAAEA/oAAAAAAAAAMQqEDAFqs4AAAAAEAAAAMbXRsLm9maS40LjEAABsADwAAAAAAAAAgA15mAWIAAAAAAAAAAACA/uCsWgADoUIMAAAAAAAAAAAAAAABAAAADHBtbC51Y3guNC4xAAAbAA8AAAAAAAAA3CCYh16fZtn0WiEGAALAqIDWzxkXt1E4ChouUCOtrjYAEyMAgot7IQYAAsCoQNbPGRe3UTi/U+JMds/BNgETIwCC7hEhEgECzOvZq3rrHYKYAADwAAAAAM8ZF7dROFOePkvX4Dc3ARMjAIKOJySDAGIAfjl3zCsz70g4UNsqMTUywyMABWFmAQgEJ4qwD6Ez70g4ULAPITUy3yIAAHrUd8wrM+9IOFCwDyE1Mt8iAAEAMdFcfOEz70g4UEUdKTUyAycAA2tmAbMPd8yrM+9IOFBFHSk1MgMnAINvZgE=","cbfunc":23455932359280,"cbdata":23455756017600}
1.440s: flux-shell[2]: TRACE: pmix: starting pmix exchange 0: size 446
1.445s: flux-shell[2]: TRACE: pmix: completed pmix exchange 0: size 1375 SUCCESS
1.444s: flux-shell[0]: TRACE: pmix: pmix server fence_upcall {"procs":[{"nspace":"f82UCeKLK","rank":-2}],"info":[{"key":"pmix.collect","flags":0,"value":{"type":1,"data":true}}],"data":"AAAAAQAPAAAAAAAAAdUAAAABAQAAAAEADwAAAAAAAAHCAAAAAQAAAAAAAAABAAAAD2J0bC5vcGVuaWIuNC4xAAAbAA8AAAAAAAAANgEAAAAAAACA/oEAAAAFAAAAyQIAABsQAAAAAQM/GEBuAACBAAEA/oAAAAAAAAAMQqEDAFqtwAAAAAEAAAAMbXRsLm9maS40LjEAABsADwAAAAAAAAAgA5xuAIEAAAAAAAAAAACA/sCtWgADoUIMAAAAAAAAAAAAAAABAAAADHBtbC51Y3guNC4xAAAbAA8AAAAAAAAA3CA0gG7ZUEHjqSEGAALAqIDOzxkXt1E4ChouUCOtrjYAEyMAgp49IQYAAsCoQM7PGRe3UTi/U+JMds/BNgETIwCCphkhEgECbxGyCj1wG76YAADwAAAAAM8ZF7dROFOePkvX4Dc3ARMjAIKBeySDAIEAfjl3zCsz70g4UNsqMTUywyMABZ5uABgEJ4qwD6Ez70g4ULAPITUy3yIAAHrUd8wrM+9IOFCwDyE1Mt8iAAEAMdFcfOEz70g4UEUdKTUyAycAA6puALMPd8yrM+9IOFBFHSk1MgMnAIOtbgAAAAABAAAADXBtbC5iYXNlLjIuMAAAGwAPAAAAAAAAAAR1Y3gA","cbfunc":23455932359280,"cbdata":23455756017600}
1.444s: flux-shell[0]: TRACE: pmix: starting pmix exchange 0: size 483
1.445s: flux-shell[0]: TRACE: pmix: completed pmix exchange 0: size 1375 SUCCESS
1.445s: flux-shell[1]: TRACE: pmix: completed pmix exchange 0: size 1375 SUCCESS
1.459s: flux-shell[2]: TRACE: pmix: pmix server fence_upcall {"procs":[{"nspace":"f82UCeKLK","rank":-2}],"info":[],"data":"","cbfunc":23455932359280,"cbdata":23455756017600}
1.459s: flux-shell[2]: TRACE: pmix: starting pmix exchange 1: size 0
1.460s: flux-shell[2]: TRACE: pmix: completed pmix exchange 1: size 0 SUCCESS
1.459s: flux-shell[1]: TRACE: pmix: pmix server fence_upcall {"procs":[{"nspace":"f82UCeKLK","rank":-2}],"info":[],"data":"","cbfunc":23455932359280,"cbdata":23455756017600}
1.459s: flux-shell[1]: TRACE: pmix: starting pmix exchange 1: size 0
1.460s: flux-shell[1]: TRACE: pmix: completed pmix exchange 1: size 0 SUCCESS
1.460s: flux-shell[0]: TRACE: pmix: pmix server fence_upcall {"procs":[{"nspace":"f82UCeKLK","rank":-2}],"info":[],"data":"","cbfunc":23455932359280,"cbdata":23455756017600}
1.460s: flux-shell[0]: TRACE: pmix: starting pmix exchange 1: size 0
1.460s: flux-shell[0]: TRACE: pmix: completed pmix exchange 1: size 0 SUCCESS
1.472s: flux-shell[1]: TRACE: pmix: pmix server fence_upcall {"procs":[{"nspace":"f82UCeKLK","rank":-2}],"info":[],"data":"","cbfunc":23455932359280,"cbdata":23455756017600}
1.472s: flux-shell[1]: TRACE: pmix: starting pmix exchange 2: size 0
1.476s: flux-shell[1]: TRACE: pmix: completed pmix exchange 2: size 0 SUCCESS
1.476s: flux-shell[1]: TRACE: pmix: pmix server fence_upcall {"procs":[{"nspace":"f82UCeKLK","rank":-2}],"info":[],"data":"","cbfunc":23455932359280,"cbdata":23455756017600}
1.476s: flux-shell[1]: TRACE: pmix: starting pmix exchange 3: size 0
1.476s: flux-shell[1]: TRACE: pmix: completed pmix exchange 3: size 0 SUCCESS
1.475s: flux-shell[0]: TRACE: pmix: pmix server fence_upcall {"procs":[{"nspace":"f82UCeKLK","rank":-2}],"info":[],"data":"","cbfunc":23455932359280,"cbdata":23455756017600}
1.475s: flux-shell[0]: TRACE: pmix: starting pmix exchange 2: size 0
1.476s: flux-shell[0]: TRACE: pmix: completed pmix exchange 2: size 0 SUCCESS
1.476s: flux-shell[0]: TRACE: pmix: pmix server fence_upcall {"procs":[{"nspace":"f82UCeKLK","rank":-2}],"info":[],"data":"","cbfunc":23455932359280,"cbdata":23455756017600}
1.476s: flux-shell[0]: TRACE: pmix: starting pmix exchange 3: size 0
1.476s: flux-shell[0]: TRACE: pmix: completed pmix exchange 3: size 0 SUCCESS
1.475s: flux-shell[2]: TRACE: pmix: pmix server fence_upcall {"procs":[{"nspace":"f82UCeKLK","rank":-2}],"info":[],"data":"","cbfunc":23455932359280,"cbdata":23455756017600}
1.475s: flux-shell[2]: TRACE: pmix: starting pmix exchange 2: size 0
1.476s: flux-shell[2]: TRACE: pmix: completed pmix exchange 2: size 0 SUCCESS
1.476s: flux-shell[2]: TRACE: pmix: pmix server fence_upcall {"procs":[{"nspace":"f82UCeKLK","rank":-2}],"info":[],"data":"","cbfunc":23455932359280,"cbdata":23455756017600}
1.476s: flux-shell[2]: TRACE: pmix: starting pmix exchange 3: size 0
1.476s: flux-shell[2]: TRACE: pmix: completed pmix exchange 3: size 0 SUCCESS
1.560s: flux-shell[1]: DEBUG: task 1 complete status=0
1.560s: flux-shell[2]: DEBUG: task 2 complete status=0
1.570s: flux-shell[1]: DEBUG: exit 0
1.571s: flux-shell[2]: DEBUG: exit 0
1.562s: flux-shell[0]: DEBUG: task 0 complete status=0
f82UCeKLK: completed MPI_Init in 1.385s.  There are 3 tasks
f82UCeKLK: completed first barrier in 0.009s
f82UCeKLK: completed MPI_Finalize in 0.083s
1.573s: flux-shell[0]: DEBUG: exit 0

from flux-core.

garlick avatar garlick commented on September 26, 2024

And the failing one with simple pmi

[garlick@corona206:mpi]$ flux run --label-io -o verbose=2 -N3 ./hello
0.066s: flux-shell[2]: DEBUG: Loading /etc/flux/shell/initrc.lua
0.066s: flux-shell[2]: TRACE: Successfully loaded flux.shell module
0.066s: flux-shell[2]: TRACE: trying to load /etc/flux/shell/initrc.lua
0.069s: flux-shell[2]: DEBUG: data-staging: Jobspec does not contain data-staging attributes. No staging necessary.
0.070s: flux-shell[2]: TRACE: trying to load /etc/flux/shell/lua.d/ascii.lua
0.073s: flux-shell[2]: TRACE: trying to load /etc/flux/shell/lua.d/intel_mpi.lua
0.073s: flux-shell[2]: TRACE: trying to load /etc/flux/shell/lua.d/mpibind.lua
0.077s: flux-shell[2]: DEBUG: mpibind: disabling cpu-affinity
0.077s: flux-shell[2]: DEBUG: mpibind: disabling gpu-affinity
0.077s: flux-shell[2]: TRACE: trying to load /etc/flux/shell/lua.d/openmpi.lua
0.077s: flux-shell[2]: TRACE: trying to load /etc/flux/shell/lua.d/rocr.lua
0.077s: flux-shell[2]: DEBUG: pmi-simple: simple wire protocol is enabled
0.077s: flux-shell[2]: DEBUG: cpu-affinity: disabling affinity due to cpu-affinity=off
0.077s: flux-shell[2]: DEBUG: gpu-affinity: disabling affinity due to gpu-affinity=off
0.078s: flux-shell[2]: DEBUG: mpibind: loaded topology from job shell
0.080s: flux-shell[2]: DEBUG: mpibind: flux given cores: 0-47
0.080s: flux-shell[2]: DEBUG: mpibind: 	derived pus: 0-95
0.080s: flux-shell[2]: DEBUG: mpibind: flux given gpus: 0-7
0.080s: flux-shell[2]: DEBUG: mpibind: total #cores: 48
0.080s: flux-shell[2]: DEBUG: mpibind: total #pus: 96
0.080s: flux-shell[2]: DEBUG: mpibind: user opts: ntasks=1 nthreads=0 restrict=0-95 greedy=-1 smt=-1 gpu_optim=-1 verbose=0 master=0 corespec_first=0 corespec_numa=0 corespec_bal=0 visible_devices=0 omp_proc_bind=0 omp_places=0 xml=shell-provided 
0.080s: flux-shell[2]: DEBUG: mpibind: task  0: cpus 0-95
0.080s: flux-shell[2]: DEBUG: 2: task 2 on cores 0-47
0.083s: flux-shell[2]: DEBUG: mpibind: task  0: setting OMP_NUM_THREADS=96
0.084s: flux-shell[2]: DEBUG: mpibind: task  0: setting OMP_PLACES=threads
0.084s: flux-shell[2]: DEBUG: mpibind: task  0: setting OMP_PROC_BIND=spread
0.084s: flux-shell[2]: DEBUG: mpibind: task  0: setting ROCR_VISIBLE_DEVICES=2,3,1,0,6,7,5,4
0.068s: flux-shell[1]: DEBUG: Loading /etc/flux/shell/initrc.lua
0.068s: flux-shell[1]: TRACE: Successfully loaded flux.shell module
0.068s: flux-shell[1]: TRACE: trying to load /etc/flux/shell/initrc.lua
0.071s: flux-shell[1]: DEBUG: data-staging: Jobspec does not contain data-staging attributes. No staging necessary.
0.072s: flux-shell[1]: TRACE: trying to load /etc/flux/shell/lua.d/ascii.lua
0.076s: flux-shell[1]: TRACE: trying to load /etc/flux/shell/lua.d/intel_mpi.lua
0.076s: flux-shell[1]: TRACE: trying to load /etc/flux/shell/lua.d/mpibind.lua
0.079s: flux-shell[1]: DEBUG: mpibind: disabling cpu-affinity
0.079s: flux-shell[1]: DEBUG: mpibind: disabling gpu-affinity
0.079s: flux-shell[1]: TRACE: trying to load /etc/flux/shell/lua.d/openmpi.lua
0.079s: flux-shell[1]: TRACE: trying to load /etc/flux/shell/lua.d/rocr.lua
0.079s: flux-shell[1]: DEBUG: pmi-simple: simple wire protocol is enabled
0.079s: flux-shell[1]: DEBUG: cpu-affinity: disabling affinity due to cpu-affinity=off
0.080s: flux-shell[1]: DEBUG: gpu-affinity: disabling affinity due to gpu-affinity=off
0.080s: flux-shell[1]: DEBUG: mpibind: loaded topology from job shell
0.082s: flux-shell[1]: DEBUG: mpibind: flux given cores: 0-47
0.082s: flux-shell[1]: DEBUG: mpibind: 	derived pus: 0-95
0.082s: flux-shell[1]: DEBUG: mpibind: flux given gpus: 0-7
0.082s: flux-shell[1]: DEBUG: mpibind: total #cores: 48
0.082s: flux-shell[1]: DEBUG: mpibind: total #pus: 96
0.082s: flux-shell[1]: DEBUG: mpibind: user opts: ntasks=1 nthreads=0 restrict=0-95 greedy=-1 smt=-1 gpu_optim=-1 verbose=0 master=0 corespec_first=0 corespec_numa=0 corespec_bal=0 visible_devices=0 omp_proc_bind=0 omp_places=0 xml=shell-provided 
0.082s: flux-shell[1]: DEBUG: mpibind: task  0: cpus 0-95
0.082s: flux-shell[1]: DEBUG: 1: task 1 on cores 0-47
0.083s: flux-shell[1]: DEBUG: mpibind: task  0: setting OMP_NUM_THREADS=96
0.083s: flux-shell[1]: DEBUG: mpibind: task  0: setting OMP_PLACES=threads
0.083s: flux-shell[1]: DEBUG: mpibind: task  0: setting OMP_PROC_BIND=spread
0.083s: flux-shell[1]: DEBUG: mpibind: task  0: setting ROCR_VISIBLE_DEVICES=2,3,1,0,6,7,5,4
0.043s: flux-shell[0]: DEBUG: Loading /etc/flux/shell/initrc.lua
0.043s: flux-shell[0]: TRACE: Successfully loaded flux.shell module
0.043s: flux-shell[0]: TRACE: trying to load /etc/flux/shell/initrc.lua
0.046s: flux-shell[0]: DEBUG: data-staging: Jobspec does not contain data-staging attributes. No staging necessary.
0.047s: flux-shell[0]: TRACE: trying to load /etc/flux/shell/lua.d/ascii.lua
0.050s: flux-shell[0]: TRACE: trying to load /etc/flux/shell/lua.d/intel_mpi.lua
0.050s: flux-shell[0]: TRACE: trying to load /etc/flux/shell/lua.d/mpibind.lua
0.054s: flux-shell[0]: DEBUG: mpibind: disabling cpu-affinity
0.054s: flux-shell[0]: DEBUG: mpibind: disabling gpu-affinity
0.054s: flux-shell[0]: TRACE: trying to load /etc/flux/shell/lua.d/openmpi.lua
0.054s: flux-shell[0]: TRACE: trying to load /etc/flux/shell/lua.d/rocr.lua
0.054s: flux-shell[0]: DEBUG: pmi-simple: simple wire protocol is enabled
0.054s: flux-shell[0]: DEBUG: output: batch timeout = 0.500s
0.055s: flux-shell[0]: DEBUG: cpu-affinity: disabling affinity due to cpu-affinity=off
0.055s: flux-shell[0]: DEBUG: gpu-affinity: disabling affinity due to gpu-affinity=off
0.056s: flux-shell[0]: DEBUG: mpibind: loaded topology from job shell
0.057s: flux-shell[0]: DEBUG: mpibind: flux given cores: 0-47
0.058s: flux-shell[0]: DEBUG: mpibind: 	derived pus: 0-95
0.058s: flux-shell[0]: DEBUG: mpibind: flux given gpus: 0-7
0.058s: flux-shell[0]: DEBUG: mpibind: total #cores: 48
0.058s: flux-shell[0]: DEBUG: mpibind: total #pus: 96
0.058s: flux-shell[0]: DEBUG: mpibind: user opts: ntasks=1 nthreads=0 restrict=0-95 greedy=-1 smt=-1 gpu_optim=-1 verbose=0 master=0 corespec_first=0 corespec_numa=0 corespec_bal=0 visible_devices=0 omp_proc_bind=0 omp_places=0 xml=shell-provided 
0.058s: flux-shell[0]: DEBUG: mpibind: task  0: cpus 0-95
0.058s: flux-shell[0]: DEBUG: 0: task_count=3 slot_count=3 cores_per_slot=1 slots_per_node=1
0.058s: flux-shell[0]: DEBUG: 0: task 0 on cores 0-47
0.084s: flux-shell[0]: DEBUG: mpibind: task  0: setting OMP_NUM_THREADS=96
0.084s: flux-shell[0]: DEBUG: mpibind: task  0: setting OMP_PLACES=threads
0.084s: flux-shell[0]: DEBUG: mpibind: task  0: setting OMP_PROC_BIND=spread
0.084s: flux-shell[0]: DEBUG: mpibind: task  0: setting ROCR_VISIBLE_DEVICES=2,3,1,0,6,7,5,4
0.129s: flux-shell[0]: TRACE: pmi-simple: 0: C: cmd=init pmi_version=1 pmi_subversion=1
0.129s: flux-shell[0]: TRACE: pmi-simple: 0: S: cmd=response_to_init rc=0 pmi_version=1 pmi_subversion=1
0.129s: flux-shell[0]: TRACE: pmi-simple: 0: C: cmd=get_maxes
0.129s: flux-shell[0]: TRACE: pmi-simple: 0: S: cmd=maxes rc=0 kvsname_max=64 keylen_max=64 vallen_max=1024
0.130s: flux-shell[0]: TRACE: pmi-simple: 0: C: cmd=get_my_kvsname
0.130s: flux-shell[0]: TRACE: pmi-simple: 0: S: cmd=my_kvsname rc=0 kvsname=fEQLAnHv7
0.130s: flux-shell[0]: TRACE: pmi-simple: 0: C: cmd=get_my_kvsname
0.130s: flux-shell[0]: TRACE: pmi-simple: 0: S: cmd=my_kvsname rc=0 kvsname=fEQLAnHv7
0.130s: flux-shell[0]: TRACE: pmi-simple: 0: C: cmd=get kvsname=fEQLAnHv7 key=flux.taskmap
0.130s: flux-shell[0]: TRACE: pmi-simple: 0: S: cmd=get_result rc=0 value={"version":1,"map":[[0,3,1,1]]}
0.130s: flux-shell[0]: TRACE: pmi-simple: 0: C: cmd=get_universe_size
0.130s: flux-shell[0]: TRACE: pmi-simple: 0: S: cmd=universe_size rc=0 size=3
0.130s: flux-shell[0]: TRACE: pmi-simple: 0: C: cmd=get_appnum
0.130s: flux-shell[0]: TRACE: pmi-simple: 0: S: cmd=appnum rc=0 appnum=0
0.133s: flux-shell[0]: TRACE: pmi-simple: 0: C: cmd=get kvsname=fEQLAnHv7 key=0-0-key0
0.133s: flux-shell[0]: TRACE: pmi-simple: 0: S: cmd=get_result rc=4
0.130s: flux-shell[2]: TRACE: pmi-simple: 2: C: cmd=init pmi_version=1 pmi_subversion=1
0.130s: flux-shell[2]: TRACE: pmi-simple: 2: S: cmd=response_to_init rc=0 pmi_version=1 pmi_subversion=1
0.130s: flux-shell[2]: TRACE: pmi-simple: 2: C: cmd=get_maxes
0.130s: flux-shell[2]: TRACE: pmi-simple: 2: S: cmd=maxes rc=0 kvsname_max=64 keylen_max=64 vallen_max=1024
0.131s: flux-shell[2]: TRACE: pmi-simple: 2: C: cmd=get_my_kvsname
0.131s: flux-shell[2]: TRACE: pmi-simple: 2: S: cmd=my_kvsname rc=0 kvsname=fEQLAnHv7
0.131s: flux-shell[2]: TRACE: pmi-simple: 2: C: cmd=get_my_kvsname
0.131s: flux-shell[2]: TRACE: pmi-simple: 2: S: cmd=my_kvsname rc=0 kvsname=fEQLAnHv7
0.131s: flux-shell[2]: TRACE: pmi-simple: 2: C: cmd=get kvsname=fEQLAnHv7 key=flux.taskmap
0.131s: flux-shell[2]: TRACE: pmi-simple: 2: S: cmd=get_result rc=0 value={"version":1,"map":[[0,3,1,1]]}
0.131s: flux-shell[2]: TRACE: pmi-simple: 2: C: cmd=get_universe_size
0.131s: flux-shell[2]: TRACE: pmi-simple: 2: S: cmd=universe_size rc=0 size=3
0.131s: flux-shell[2]: TRACE: pmi-simple: 2: C: cmd=get_appnum
0.131s: flux-shell[2]: TRACE: pmi-simple: 2: S: cmd=appnum rc=0 appnum=0
0.134s: flux-shell[2]: TRACE: pmi-simple: 2: C: cmd=get kvsname=fEQLAnHv7 key=0-2-key0
0.134s: flux-shell[2]: TRACE: pmi-simple: 2: S: cmd=get_result rc=4
0.131s: flux-shell[1]: TRACE: pmi-simple: 1: C: cmd=init pmi_version=1 pmi_subversion=1
0.131s: flux-shell[1]: TRACE: pmi-simple: 1: S: cmd=response_to_init rc=0 pmi_version=1 pmi_subversion=1
0.131s: flux-shell[1]: TRACE: pmi-simple: 1: C: cmd=get_maxes
0.131s: flux-shell[1]: TRACE: pmi-simple: 1: S: cmd=maxes rc=0 kvsname_max=64 keylen_max=64 vallen_max=1024
0.132s: flux-shell[1]: TRACE: pmi-simple: 1: C: cmd=get_my_kvsname
0.132s: flux-shell[1]: TRACE: pmi-simple: 1: S: cmd=my_kvsname rc=0 kvsname=fEQLAnHv7
0.132s: flux-shell[1]: TRACE: pmi-simple: 1: C: cmd=get_my_kvsname
0.132s: flux-shell[1]: TRACE: pmi-simple: 1: S: cmd=my_kvsname rc=0 kvsname=fEQLAnHv7
0.132s: flux-shell[1]: TRACE: pmi-simple: 1: C: cmd=get kvsname=fEQLAnHv7 key=flux.taskmap
0.132s: flux-shell[1]: TRACE: pmi-simple: 1: S: cmd=get_result rc=0 value={"version":1,"map":[[0,3,1,1]]}
0.132s: flux-shell[1]: TRACE: pmi-simple: 1: C: cmd=get_universe_size
0.132s: flux-shell[1]: TRACE: pmi-simple: 1: S: cmd=universe_size rc=0 size=3
0.132s: flux-shell[1]: TRACE: pmi-simple: 1: C: cmd=get_appnum
0.132s: flux-shell[1]: TRACE: pmi-simple: 1: S: cmd=appnum rc=0 appnum=0
0.135s: flux-shell[1]: TRACE: pmi-simple: 1: C: cmd=get kvsname=fEQLAnHv7 key=0-1-key0
0.135s: flux-shell[1]: TRACE: pmi-simple: 1: S: cmd=get_result rc=4
1.427s: flux-shell[0]: TRACE: pmi-simple: 0: C: cmd=put kvsname=fEQLAnHv7 key=0-0-key0 value=cG1peC5jcHVzZXQAMDMAMDAwNQAwLTk1AHBtaXguaG5hbWUAMDMAMDAwYQBjb3JvbmEyMDYAYnRsLm9wZW5pYi40LjEAMTQAMDAzNgABAAAAAAAAgP6BAAAABQAAAMkCAAAbEAAAAAEDPxgkbwAAgQABAP6AAAAAAAAADEKhAwBarcBtdGwub2ZpLjQuMQAxNAAwMDIwAAOAbwCBAAAAAAAAAAAAgP7ArVoAA6FCDAAAAAAAAAAAcG1sLnVjeC40LjEAMTQAMDBkYwAgWUDHX2lXpu8hBgACwKiAzs8ZF7dROAoaLlAjra42ABMjAIKIySEGAALAqEDOzxkXt1E4v1PiTHbPwTYBEyMAgosNIRIBAm8Rsgo9cBu+mAAA8AAAAADPGRe3UThTnj5L1+A3NwETIwCCnLckgwCBAH45d8wrM+9IOFDbKjE1MsMjAAWDbwAYBCeKsA+hM+9IOFCwDyE1Mt8iAAB61HfMKzPvSDhQsA8hNTLfIgABADHRXHzhM+9IOFBFHSk1MgMnAAOObwCzD3fMqzPvSDhQRR0pNTIDJwCDkW8AcG1sLnVjeC40LjEAMTQAMDE2NAAgWUDHX2lXpu8gAGN1d8wrMgD4108AAAAAAE8DAIihSrGsXNBN3yEGAALAqIDOzxkXt1E4ChouUCOtrjYAEyMAgojJIQYAAsCoQM7PGRe3UTi/U+JMds/BNgETIwCCiw0hEgECbxGyCj1wG76YAADwAAAAAM8ZF7dROFOePkvX4Dc3ARMjAIKct0IIbxGyCj1wGz6NU3fMKzIAAHBQd8yrMwBPIwCIHwAHAAAAAABDCG8Rsgo9cBs+p9N3zCsyAABwUHfMqzMATyMAiAixP8AJAADAJAMAgQB+OXfMKzPvSDhQ2yoxNTLDIwAFg28AGAQnirAPoTPvSDhQsA8hNTLfIgAAetR3zCsz70g4ULAPITUy3yIAAQAx0Vx84TPvSDhQRR0pNTIDJwADjm8Asw93zKsz70g4UEUdKTUyAycAg5FvACW
1.427s: flux-shell[0]: TRACE: pmi-simple: 0: S: cmd=put_result rc=0
1.434s: flux-shell[0]: TRACE: pmi-simple: 0: C: cmd=put kvsname=fEQLAnHv7 key=0-0-key1 value=IbxGyCj1wGz5Hi703BjYAJC5Qd8yrMwCSIQCECLE/AHBtbC5iYXNlLjIuMAAxNAAwMDA0AHVjeAA -
1.434s: flux-shell[0]: TRACE: pmi-simple: 0: S: cmd=put_result rc=0
1.434s: flux-shell[0]: TRACE: pmi-simple: 0: C: cmd=barrier_in
1.452s: flux-shell[1]: TRACE: pmi-simple: 1: C: cmd=put kvsname=fEQLAnHv7 key=0-1-key0 value=cG1peC5jcHVzZXQAMDMAMDAwNQAwLTk1AHBtaXguaG5hbWUAMDMAMDAwYQBjb3JvbmEyMDcAYnRsLm9wZW5pYi40LjEAMTQAMDAzNgABAAAAAAAAgP5bAAAABQAAAMkCAAAbEAAAAAEDPxjimwAAWwABAP6AAAAAAAAADEKhAwBarbxtdGwub2ZpLjQuMQAxNAAwMDIwAAM+nABbAAAAAAAAAAAAgP68rVoAA6FCDAAAAAAAAAAAcG1sLnVjeC40LjEAMTQAMDBkYwAg9gECSPg9FVIhEgECEZTtisloWbSYAADwAAAAAM8ZF7dROFOePkvX4Dc3ARMjAILmKyEGAALAqIDPzxkXt1E4ChouUCOtrjYAEyMAgpjjIQYAAsCoQM/PGRe3UTi/U+JMds/BNgETIwCCpPskgwBbAH45d8wrM+9IOFDbKjE1MsMjAAVAnAAxBCeKsA+hM+9IOFCwDyE1Mt8iAAB61HfMKzPvSDhQsA8hNTLfIgABADHRXHzhM+9IOFBFHSk1MgMnAANMnACzD3fMqzPvSDhQRR0pNTIDJwCDT5wAcG1sLnVjeC40LjEAMTQAMDE2NAAg9gECSPg9FVIgAGN1d8wrMgD4108AAAAAAE8DAIjub/N2VuY6fCESAQIRlO2KyWhZtJgAAPAAAAAAzxkXt1E4U54+S9fgNzcBEyMAguYrIQYAAsCogM/PGRe3UTgKGi5QI62uNgATIwCCmOMhBgACwKhAz88ZF7dROL9T4kx2z8E2ARMjAIKk+0IIEZTtisloWTSNU3fMKzIAAHBQd8yrMwBPIwCIOgANAAAAAABDCBGU7YrJaFk0p9N3zCsyAABwUHfMqzMATyMAiCy5PcAJAADAJAMAWwB+OXfMKzPvSDhQ2yoxNTLDIwAFQJwAMQQnirAPoTPvSDhQsA8hNTLfIgAAetR3zCsz70g4ULAPITUy3yIAAQAx0Vx84TPvSDhQRR0pNTIDJwADTJwAsw93zKsz70g4UEUdKTUyAycAg0+cACW
1.452s: flux-shell[1]: TRACE: pmi-simple: 1: S: cmd=put_result rc=0
1.462s: flux-shell[1]: TRACE: pmi-simple: 1: C: cmd=put kvsname=fEQLAnHv7 key=0-1-key1 value=IEZTtisloWTRHi703BjYAJC5Qd8yrMwCSIQCELLk9AA  -
1.462s: flux-shell[1]: TRACE: pmi-simple: 1: S: cmd=put_result rc=0
1.460s: flux-shell[2]: TRACE: pmi-simple: 2: C: cmd=put kvsname=fEQLAnHv7 key=0-2-key0 value=cG1peC5jcHVzZXQAMDMAMDAwNQAwLTk1AHBtaXguaG5hbWUAMDMAMDAwYQBjb3JvbmEyMTQAYnRsLm9wZW5pYi40LjEAMTQAMDAzNgABAAAAAAAAgP5iAAAABQAAAMkCAAAbEAAAAAEDPxjkZgEAYgABAP6AAAAAAAAADEKhAwBarOBtdGwub2ZpLjQuMQAxNAAwMDIwAANAZwFiAAAAAAAAAAAAgP7grFoAA6FCDAAAAAAAAAAAcG1sLnVjeC40LjEAMTQAMDBkYwAgTyjnvBJrLgchBgACwKiA1s8ZF7dROAoaLlAjra42ABMjAIKycSEGAALAqEDWzxkXt1E4v1PiTHbPwTYBEyMAgqazIRIBAszr2at66x2CmAAA8AAAAADPGRe3UThTnj5L1+A3NwETIwCCyYEkgwBiAH45d8wrM+9IOFDbKjE1MsMjAAVCZwEIBCeKsA+hM+9IOFCwDyE1Mt8iAAB61HfMKzPvSDhQsA8hNTLfIgABADHRXHzhM+9IOFBFHSk1MgMnAANOZwGzD3fMqzPvSDhQRR0pNTIDJwCDUWcBcG1sLnVjeC40LjEAMTQAMDE2NAAgTyjnvBJrLgcgAGN1d8wrMgD4108AAAAAAE8DAIi9VzmKHgS14yEGAALAqIDWzxkXt1E4ChouUCOtrjYAEyMAgrJxIQYAAsCoQNbPGRe3UTi/U+JMds/BNgETIwCCprMhEgECzOvZq3rrHYKYAADwAAAAAM8ZF7dROFOePkvX4Dc3ARMjAILJgUIIzOvZq3rrHQKNU3fMKzIAAHBQd8yrMwBPIwCIMABuAAAAAABDCMzr2at66x0Cp9N3zCsyAABwUHfMqzMATyMAiH6MBsAJAADAJAMAYgB+OXfMKzPvSDhQ2yoxNTLDIwAFQmcBCAQnirAPoTPvSDhQsA8hNTLfIgAAetR3zCsz70g4ULAPITUy3yIAAQAx0Vx84TPvSDhQRR0pNTIDJwADTmcBsw93zKsz70g4UEUdKTUyAycAg1FnASW
1.460s: flux-shell[2]: TRACE: pmi-simple: 2: S: cmd=put_result rc=0
1.466s: flux-shell[2]: TRACE: pmi-simple: 2: C: cmd=put kvsname=fEQLAnHv7 key=0-2-key1 value=IzOvZq3rrHQJHi703BjYAJC5Qd8yrMwCSIQCEfowGAA  -
1.466s: flux-shell[2]: TRACE: pmi-simple: 2: S: cmd=put_result rc=0
1.467s: flux-shell[2]: TRACE: pmi-simple: 2: C: cmd=barrier_in
1.467s: flux-shell[2]: TRACE: pmi-simple: 2: S: cmd=barrier_out rc=0
1.468s: flux-shell[2]: TRACE: pmi-simple: 2: C: cmd=get kvsname=fEQLAnHv7 key=0-2-key0
1.468s: flux-shell[2]: TRACE: pmi-simple: 2: S: cmd=get_result rc=0 value=cG1peC5jcHVzZXQAMDMAMDAwNQAwLTk1AHBtaXguaG5hbWUAMDMAMDAwYQBjb3JvbmEyMTQAYnRsLm9wZW5pYi40LjEAMTQAMDAzNgABAAAAAAAAgP5iAAAABQAAAMkCAAAbEAAAAAEDPxjkZgEAYgABAP6AAAAAAAAADEKhAwBarOBtdGwub2ZpLjQuMQAxNAAwMDIwAANAZwFiAAAAAAAAAAAAgP7grFoAA6FCDAAAAAAAAAAAcG1sLnVjeC40LjEAMTQAMDBkYwAgTyjnvBJrLgchBgACwKiA1s8ZF7dROAoaLlAjra42ABMjAIKycSEGAALAqEDWzxkXt1E4v1PiTHbPwTYBEyMAgqazIRIBAszr2at66x2CmAAA8AAAAADPGRe3UThTnj5L1+A3NwETIwCCyYEkgwBiAH45d8wrM+9IOFDbKjE1MsMjAAVCZwEIBCeKsA+hM+9IOFCwDyE1Mt8iAAB61HfMKzPvSDhQsA8hNTLfIgABADHRXHzhM+9IOFBFHSk1MgMnAANOZwGzD3fMqzPvSDhQRR0pNTIDJwCDUWcBcG1sLnVjeC40LjEAMTQAMDE2NAAgTyjnvBJrLgcgAGN1d8wrMgD4108AAAAAAE8DAIi9VzmKHgS14yEGAALAqIDWzxkXt1E4ChouUCOtrjYAEyMAgrJxIQYAAsCoQNbPGRe3UTi/U+JMds/BNgETIwCCprMhEgECzOvZq3rrHYKYAADwAAAAAM8ZF7dROFOePkvX4Dc3ARMjAILJgUIIzOvZq3rrHQKNU3fMKzIAAHBQd8yrMwBPIwCIMABuAAAAAABDCMzr2at66x0Cp9N3zCsyAABwUHfMqzMATyMAiH6MBsAJAADAJAMAYgB+OXfMKzPvSDhQ2yoxNTLDIwAFQmcBCAQnirAPoTPvSDhQsA8hNTLfIgAAetR3zCsz70g4ULAPITUy3yIAAQAx0Vx84TPvSDhQRR0pNTIDJwADTmcBsw93zKsz70g4UEUdKTUyAycAg1FnASW
1.468s: flux-shell[2]: TRACE: pmi-simple: 2: C: cmd=get kvsname=fEQLAnHv7 key=0-2-key1
1.468s: flux-shell[2]: TRACE: pmi-simple: 2: S: cmd=get_result rc=0 value=IzOvZq3rrHQJHi703BjYAJC5Qd8yrMwCSIQCEfowGAA  -
1.462s: flux-shell[1]: TRACE: pmi-simple: 1: C: cmd=barrier_in
1.467s: flux-shell[1]: TRACE: pmi-simple: 1: S: cmd=barrier_out rc=0
1.468s: flux-shell[1]: TRACE: pmi-simple: 1: C: cmd=get kvsname=fEQLAnHv7 key=0-1-key0
1.468s: flux-shell[1]: TRACE: pmi-simple: 1: S: cmd=get_result rc=0 value=cG1peC5jcHVzZXQAMDMAMDAwNQAwLTk1AHBtaXguaG5hbWUAMDMAMDAwYQBjb3JvbmEyMDcAYnRsLm9wZW5pYi40LjEAMTQAMDAzNgABAAAAAAAAgP5bAAAABQAAAMkCAAAbEAAAAAEDPxjimwAAWwABAP6AAAAAAAAADEKhAwBarbxtdGwub2ZpLjQuMQAxNAAwMDIwAAM+nABbAAAAAAAAAAAAgP68rVoAA6FCDAAAAAAAAAAAcG1sLnVjeC40LjEAMTQAMDBkYwAg9gECSPg9FVIhEgECEZTtisloWbSYAADwAAAAAM8ZF7dROFOePkvX4Dc3ARMjAILmKyEGAALAqIDPzxkXt1E4ChouUCOtrjYAEyMAgpjjIQYAAsCoQM/PGRe3UTi/U+JMds/BNgETIwCCpPskgwBbAH45d8wrM+9IOFDbKjE1MsMjAAVAnAAxBCeKsA+hM+9IOFCwDyE1Mt8iAAB61HfMKzPvSDhQsA8hNTLfIgABADHRXHzhM+9IOFBFHSk1MgMnAANMnACzD3fMqzPvSDhQRR0pNTIDJwCDT5wAcG1sLnVjeC40LjEAMTQAMDE2NAAg9gECSPg9FVIgAGN1d8wrMgD4108AAAAAAE8DAIjub/N2VuY6fCESAQIRlO2KyWhZtJgAAPAAAAAAzxkXt1E4U54+S9fgNzcBEyMAguYrIQYAAsCogM/PGRe3UTgKGi5QI62uNgATIwCCmOMhBgACwKhAz88ZF7dROL9T4kx2z8E2ARMjAIKk+0IIEZTtisloWTSNU3fMKzIAAHBQd8yrMwBPIwCIOgANAAAAAABDCBGU7YrJaFk0p9N3zCsyAABwUHfMqzMATyMAiCy5PcAJAADAJAMAWwB+OXfMKzPvSDhQ2yoxNTLDIwAFQJwAMQQnirAPoTPvSDhQsA8hNTLfIgAAetR3zCsz70g4ULAPITUy3yIAAQAx0Vx84TPvSDhQRR0pNTIDJwADTJwAsw93zKsz70g4UEUdKTUyAycAg0+cACW
1.468s: flux-shell[1]: TRACE: pmi-simple: 1: C: cmd=get kvsname=fEQLAnHv7 key=0-1-key1
1.468s: flux-shell[1]: TRACE: pmi-simple: 1: S: cmd=get_result rc=0 value=IEZTtisloWTRHi703BjYAJC5Qd8yrMwCSIQCELLk9AA  -
1.467s: flux-shell[0]: TRACE: pmi-simple: 0: S: cmd=barrier_out rc=0
1.467s: flux-shell[0]: TRACE: pmi-simple: 0: C: cmd=get kvsname=fEQLAnHv7 key=0-0-key0
1.467s: flux-shell[0]: TRACE: pmi-simple: 0: S: cmd=get_result rc=0 value=cG1peC5jcHVzZXQAMDMAMDAwNQAwLTk1AHBtaXguaG5hbWUAMDMAMDAwYQBjb3JvbmEyMDYAYnRsLm9wZW5pYi40LjEAMTQAMDAzNgABAAAAAAAAgP6BAAAABQAAAMkCAAAbEAAAAAEDPxgkbwAAgQABAP6AAAAAAAAADEKhAwBarcBtdGwub2ZpLjQuMQAxNAAwMDIwAAOAbwCBAAAAAAAAAAAAgP7ArVoAA6FCDAAAAAAAAAAAcG1sLnVjeC40LjEAMTQAMDBkYwAgWUDHX2lXpu8hBgACwKiAzs8ZF7dROAoaLlAjra42ABMjAIKIySEGAALAqEDOzxkXt1E4v1PiTHbPwTYBEyMAgosNIRIBAm8Rsgo9cBu+mAAA8AAAAADPGRe3UThTnj5L1+A3NwETIwCCnLckgwCBAH45d8wrM+9IOFDbKjE1MsMjAAWDbwAYBCeKsA+hM+9IOFCwDyE1Mt8iAAB61HfMKzPvSDhQsA8hNTLfIgABADHRXHzhM+9IOFBFHSk1MgMnAAOObwCzD3fMqzPvSDhQRR0pNTIDJwCDkW8AcG1sLnVjeC40LjEAMTQAMDE2NAAgWUDHX2lXpu8gAGN1d8wrMgD4108AAAAAAE8DAIihSrGsXNBN3yEGAALAqIDOzxkXt1E4ChouUCOtrjYAEyMAgojJIQYAAsCoQM7PGRe3UTi/U+JMds/BNgETIwCCiw0hEgECbxGyCj1wG76YAADwAAAAAM8ZF7dROFOePkvX4Dc3ARMjAIKct0IIbxGyCj1wGz6NU3fMKzIAAHBQd8yrMwBPIwCIHwAHAAAAAABDCG8Rsgo9cBs+p9N3zCsyAABwUHfMqzMATyMAiAixP8AJAADAJAMAgQB+OXfMKzPvSDhQ2yoxNTLDIwAFg28AGAQnirAPoTPvSDhQsA8hNTLfIgAAetR3zCsz70g4ULAPITUy3yIAAQAx0Vx84TPvSDhQRR0pNTIDJwADjm8Asw93zKsz70g4UEUdKTUyAycAg5FvACW
1.468s: flux-shell[0]: TRACE: pmi-simple: 0: C: cmd=get kvsname=fEQLAnHv7 key=0-0-key1
1.468s: flux-shell[0]: TRACE: pmi-simple: 0: S: cmd=get_result rc=0 value=IbxGyCj1wGz5Hi703BjYAJC5Qd8yrMwCSIQCECLE/AHBtbC5iYXNlLjIuMAAxNAAwMDA0AHVjeAA -
1.482s: flux-shell[2]: TRACE: pmi-simple: 2: C: cmd=barrier_in
1.483s: flux-shell[2]: TRACE: pmi-simple: 2: S: cmd=barrier_out rc=0
1.483s: flux-shell[2]: TRACE: pmi-simple: 2: C: cmd=get kvsname=fEQLAnHv7 key=0-0-key0
1.483s: flux-shell[2]: TRACE: pmi-simple: 2: S: cmd=get_result rc=0 value=cG1peC5jcHVzZXQAMDMAMDAwNQAwLTk1AHBtaXguaG5hbWUAMDMAMDAwYQBjb3JvbmEyMDYAYnRsLm9wZW5pYi40LjEAMTQAMDAzNgABAAAAAAAAgP6BAAAABQAAAMkCAAAbEAAAAAEDPxgkbwAAgQABAP6AAAAAAAAADEKhAwBarcBtdGwub2ZpLjQuMQAxNAAwMDIwAAOAbwCBAAAAAAAAAAAAgP7ArVoAA6FCDAAAAAAAAAAAcG1sLnVjeC40LjEAMTQAMDBkYwAgWUDHX2lXpu8hBgACwKiAzs8ZF7dROAoaLlAjra42ABMjAIKIySEGAALAqEDOzxkXt1E4v1PiTHbPwTYBEyMAgosNIRIBAm8Rsgo9cBu+mAAA8AAAAADPGRe3UThTnj5L1+A3NwETIwCCnLckgwCBAH45d8wrM+9IOFDbKjE1MsMjAAWDbwAYBCeKsA+hM+9IOFCwDyE1Mt8iAAB61HfMKzPvSDhQsA8hNTLfIgABADHRXHzhM+9IOFBFHSk1MgMnAAOObwCzD3fMqzPvSDhQRR0pNTIDJwCDkW8AcG1sLnVjeC40LjEAMTQAMDE2NAAgWUDHX2lXpu8gAGN1d8wrMgD4108AAAAAAE8DAIihSrGsXNBN3yEGAALAqIDOzxkXt1E4ChouUCOtrjYAEyMAgojJIQYAAsCoQM7PGRe3UTi/U+JMds/BNgETIwCCiw0hEgECbxGyCj1wG76YAADwAAAAAM8ZF7dROFOePkvX4Dc3ARMjAIKct0IIbxGyCj1wGz6NU3fMKzIAAHBQd8yrMwBPIwCIHwAHAAAAAABDCG8Rsgo9cBs+p9N3zCsyAABwUHfMqzMATyMAiAixP8AJAADAJAMAgQB+OXfMKzPvSDhQ2yoxNTLDIwAFg28AGAQnirAPoTPvSDhQsA8hNTLfIgAAetR3zCsz70g4ULAPITUy3yIAAQAx0Vx84TPvSDhQRR0pNTIDJwADjm8Asw93zKsz70g4UEUdKTUyAycAg5FvACW
1.484s: flux-shell[2]: TRACE: pmi-simple: 2: C: cmd=get kvsname=fEQLAnHv7 key=0-0-key1
1.484s: flux-shell[2]: TRACE: pmi-simple: 2: S: cmd=get_result rc=0 value=IbxGyCj1wGz5Hi703BjYAJC5Qd8yrMwCSIQCECLE/AHBtbC5iYXNlLjIuMAAxNAAwMDA0AHVjeAA -
1.482s: flux-shell[0]: TRACE: pmi-simple: 0: C: cmd=barrier_in
1.483s: flux-shell[0]: TRACE: pmi-simple: 0: S: cmd=barrier_out rc=0
1.482s: flux-shell[1]: TRACE: pmi-simple: 1: C: cmd=barrier_in
1.483s: flux-shell[1]: TRACE: pmi-simple: 1: S: cmd=barrier_out rc=0
1.483s: flux-shell[1]: TRACE: pmi-simple: 1: C: cmd=get kvsname=fEQLAnHv7 key=0-0-key0
1.483s: flux-shell[1]: TRACE: pmi-simple: 1: S: cmd=get_result rc=0 value=cG1peC5jcHVzZXQAMDMAMDAwNQAwLTk1AHBtaXguaG5hbWUAMDMAMDAwYQBjb3JvbmEyMDYAYnRsLm9wZW5pYi40LjEAMTQAMDAzNgABAAAAAAAAgP6BAAAABQAAAMkCAAAbEAAAAAEDPxgkbwAAgQABAP6AAAAAAAAADEKhAwBarcBtdGwub2ZpLjQuMQAxNAAwMDIwAAOAbwCBAAAAAAAAAAAAgP7ArVoAA6FCDAAAAAAAAAAAcG1sLnVjeC40LjEAMTQAMDBkYwAgWUDHX2lXpu8hBgACwKiAzs8ZF7dROAoaLlAjra42ABMjAIKIySEGAALAqEDOzxkXt1E4v1PiTHbPwTYBEyMAgosNIRIBAm8Rsgo9cBu+mAAA8AAAAADPGRe3UThTnj5L1+A3NwETIwCCnLckgwCBAH45d8wrM+9IOFDbKjE1MsMjAAWDbwAYBCeKsA+hM+9IOFCwDyE1Mt8iAAB61HfMKzPvSDhQsA8hNTLfIgABADHRXHzhM+9IOFBFHSk1MgMnAAOObwCzD3fMqzPvSDhQRR0pNTIDJwCDkW8AcG1sLnVjeC40LjEAMTQAMDE2NAAgWUDHX2lXpu8gAGN1d8wrMgD4108AAAAAAE8DAIihSrGsXNBN3yEGAALAqIDOzxkXt1E4ChouUCOtrjYAEyMAgojJIQYAAsCoQM7PGRe3UTi/U+JMds/BNgETIwCCiw0hEgECbxGyCj1wG76YAADwAAAAAM8ZF7dROFOePkvX4Dc3ARMjAIKct0IIbxGyCj1wGz6NU3fMKzIAAHBQd8yrMwBPIwCIHwAHAAAAAABDCG8Rsgo9cBs+p9N3zCsyAABwUHfMqzMATyMAiAixP8AJAADAJAMAgQB+OXfMKzPvSDhQ2yoxNTLDIwAFg28AGAQnirAPoTPvSDhQsA8hNTLfIgAAetR3zCsz70g4ULAPITUy3yIAAQAx0Vx84TPvSDhQRR0pNTIDJwADjm8Asw93zKsz70g4UEUdKTUyAycAg5FvACW
1.483s: flux-shell[1]: TRACE: pmi-simple: 1: C: cmd=get kvsname=fEQLAnHv7 key=0-0-key1
1.484s: flux-shell[1]: TRACE: pmi-simple: 1: S: cmd=get_result rc=0 value=IbxGyCj1wGz5Hi703BjYAJC5Qd8yrMwCSIQCECLE/AHBtbC5iYXNlLjIuMAAxNAAwMDA0AHVjeAA -
1.493s: flux-shell[0]: TRACE: pmi-simple: 0: C: cmd=get kvsname=fEQLAnHv7 key=0-1-key0
1.493s: flux-shell[0]: TRACE: pmi-simple: 0: S: cmd=get_result rc=0 value=cG1peC5jcHVzZXQAMDMAMDAwNQAwLTk1AHBtaXguaG5hbWUAMDMAMDAwYQBjb3JvbmEyMDcAYnRsLm9wZW5pYi40LjEAMTQAMDAzNgABAAAAAAAAgP5bAAAABQAAAMkCAAAbEAAAAAEDPxjimwAAWwABAP6AAAAAAAAADEKhAwBarbxtdGwub2ZpLjQuMQAxNAAwMDIwAAM+nABbAAAAAAAAAAAAgP68rVoAA6FCDAAAAAAAAAAAcG1sLnVjeC40LjEAMTQAMDBkYwAg9gECSPg9FVIhEgECEZTtisloWbSYAADwAAAAAM8ZF7dROFOePkvX4Dc3ARMjAILmKyEGAALAqIDPzxkXt1E4ChouUCOtrjYAEyMAgpjjIQYAAsCoQM/PGRe3UTi/U+JMds/BNgETIwCCpPskgwBbAH45d8wrM+9IOFDbKjE1MsMjAAVAnAAxBCeKsA+hM+9IOFCwDyE1Mt8iAAB61HfMKzPvSDhQsA8hNTLfIgABADHRXHzhM+9IOFBFHSk1MgMnAANMnACzD3fMqzPvSDhQRR0pNTIDJwCDT5wAcG1sLnVjeC40LjEAMTQAMDE2NAAg9gECSPg9FVIgAGN1d8wrMgD4108AAAAAAE8DAIjub/N2VuY6fCESAQIRlO2KyWhZtJgAAPAAAAAAzxkXt1E4U54+S9fgNzcBEyMAguYrIQYAAsCogM/PGRe3UTgKGi5QI62uNgATIwCCmOMhBgACwKhAz88ZF7dROL9T4kx2z8E2ARMjAIKk+0IIEZTtisloWTSNU3fMKzIAAHBQd8yrMwBPIwCIOgANAAAAAABDCBGU7YrJaFk0p9N3zCsyAABwUHfMqzMATyMAiCy5PcAJAADAJAMAWwB+OXfMKzPvSDhQ2yoxNTLDIwAFQJwAMQQnirAPoTPvSDhQsA8hNTLfIgAAetR3zCsz70g4ULAPITUy3yIAAQAx0Vx84TPvSDhQRR0pNTIDJwADTJwAsw93zKsz70g4UEUdKTUyAycAg0+cACW
1.494s: flux-shell[0]: TRACE: pmi-simple: 0: C: cmd=get kvsname=fEQLAnHv7 key=0-1-key1
1.494s: flux-shell[0]: TRACE: pmi-simple: 0: S: cmd=get_result rc=0 value=IEZTtisloWTRHi703BjYAJC5Qd8yrMwCSIQCELLk9AA  -
1.496s: flux-shell[0]: TRACE: pmi-simple: 0: C: cmd=get kvsname=fEQLAnHv7 key=0-2-key0
1.496s: flux-shell[0]: TRACE: pmi-simple: 0: S: cmd=get_result rc=0 value=cG1peC5jcHVzZXQAMDMAMDAwNQAwLTk1AHBtaXguaG5hbWUAMDMAMDAwYQBjb3JvbmEyMTQAYnRsLm9wZW5pYi40LjEAMTQAMDAzNgABAAAAAAAAgP5iAAAABQAAAMkCAAAbEAAAAAEDPxjkZgEAYgABAP6AAAAAAAAADEKhAwBarOBtdGwub2ZpLjQuMQAxNAAwMDIwAANAZwFiAAAAAAAAAAAAgP7grFoAA6FCDAAAAAAAAAAAcG1sLnVjeC40LjEAMTQAMDBkYwAgTyjnvBJrLgchBgACwKiA1s8ZF7dROAoaLlAjra42ABMjAIKycSEGAALAqEDWzxkXt1E4v1PiTHbPwTYBEyMAgqazIRIBAszr2at66x2CmAAA8AAAAADPGRe3UThTnj5L1+A3NwETIwCCyYEkgwBiAH45d8wrM+9IOFDbKjE1MsMjAAVCZwEIBCeKsA+hM+9IOFCwDyE1Mt8iAAB61HfMKzPvSDhQsA8hNTLfIgABADHRXHzhM+9IOFBFHSk1MgMnAANOZwGzD3fMqzPvSDhQRR0pNTIDJwCDUWcBcG1sLnVjeC40LjEAMTQAMDE2NAAgTyjnvBJrLgcgAGN1d8wrMgD4108AAAAAAE8DAIi9VzmKHgS14yEGAALAqIDWzxkXt1E4ChouUCOtrjYAEyMAgrJxIQYAAsCoQNbPGRe3UTi/U+JMds/BNgETIwCCprMhEgECzOvZq3rrHYKYAADwAAAAAM8ZF7dROFOePkvX4Dc3ARMjAILJgUIIzOvZq3rrHQKNU3fMKzIAAHBQd8yrMwBPIwCIMABuAAAAAABDCMzr2at66x0Cp9N3zCsyAABwUHfMqzMATyMAiH6MBsAJAADAJAMAYgB+OXfMKzPvSDhQ2yoxNTLDIwAFQmcBCAQnirAPoTPvSDhQsA8hNTLfIgAAetR3zCsz70g4ULAPITUy3yIAAQAx0Vx84TPvSDhQRR0pNTIDJwADTmcBsw93zKsz70g4UEUdKTUyAycAg1FnASW
1.496s: flux-shell[0]: TRACE: pmi-simple: 0: C: cmd=get kvsname=fEQLAnHv7 key=0-2-key1
1.496s: flux-shell[0]: TRACE: pmi-simple: 0: S: cmd=get_result rc=0 value=IzOvZq3rrHQJHi703BjYAJC5Qd8yrMwCSIQCEfowGAA  -
1.499s: flux-shell[0]: TRACE: pmi-simple: 0: C: cmd=barrier_in
1.499s: flux-shell[0]: TRACE: pmi-simple: 0: S: cmd=barrier_out rc=0
1.496s: flux-shell[1]: TRACE: pmi-simple: 1: C: cmd=barrier_in
1.500s: flux-shell[1]: TRACE: pmi-simple: 1: S: cmd=barrier_out rc=0
1.500s: flux-shell[1]: TRACE: pmi-simple: 1: C: cmd=get kvsname=fEQLAnHv7 key=0-2-key0
1.500s: flux-shell[1]: TRACE: pmi-simple: 1: S: cmd=get_result rc=0 value=cG1peC5jcHVzZXQAMDMAMDAwNQAwLTk1AHBtaXguaG5hbWUAMDMAMDAwYQBjb3JvbmEyMTQAYnRsLm9wZW5pYi40LjEAMTQAMDAzNgABAAAAAAAAgP5iAAAABQAAAMkCAAAbEAAAAAEDPxjkZgEAYgABAP6AAAAAAAAADEKhAwBarOBtdGwub2ZpLjQuMQAxNAAwMDIwAANAZwFiAAAAAAAAAAAAgP7grFoAA6FCDAAAAAAAAAAAcG1sLnVjeC40LjEAMTQAMDBkYwAgTyjnvBJrLgchBgACwKiA1s8ZF7dROAoaLlAjra42ABMjAIKycSEGAALAqEDWzxkXt1E4v1PiTHbPwTYBEyMAgqazIRIBAszr2at66x2CmAAA8AAAAADPGRe3UThTnj5L1+A3NwETIwCCyYEkgwBiAH45d8wrM+9IOFDbKjE1MsMjAAVCZwEIBCeKsA+hM+9IOFCwDyE1Mt8iAAB61HfMKzPvSDhQsA8hNTLfIgABADHRXHzhM+9IOFBFHSk1MgMnAANOZwGzD3fMqzPvSDhQRR0pNTIDJwCDUWcBcG1sLnVjeC40LjEAMTQAMDE2NAAgTyjnvBJrLgcgAGN1d8wrMgD4108AAAAAAE8DAIi9VzmKHgS14yEGAALAqIDWzxkXt1E4ChouUCOtrjYAEyMAgrJxIQYAAsCoQNbPGRe3UTi/U+JMds/BNgETIwCCprMhEgECzOvZq3rrHYKYAADwAAAAAM8ZF7dROFOePkvX4Dc3ARMjAILJgUIIzOvZq3rrHQKNU3fMKzIAAHBQd8yrMwBPIwCIMABuAAAAAABDCMzr2at66x0Cp9N3zCsyAABwUHfMqzMATyMAiH6MBsAJAADAJAMAYgB+OXfMKzPvSDhQ2yoxNTLDIwAFQmcBCAQnirAPoTPvSDhQsA8hNTLfIgAAetR3zCsz70g4ULAPITUy3yIAAQAx0Vx84TPvSDhQRR0pNTIDJwADTmcBsw93zKsz70g4UEUdKTUyAycAg1FnASW
1.500s: flux-shell[1]: TRACE: pmi-simple: 1: C: cmd=get kvsname=fEQLAnHv7 key=0-2-key1
1.500s: flux-shell[1]: TRACE: pmi-simple: 1: S: cmd=get_result rc=0 value=IzOvZq3rrHQJHi703BjYAJC5Qd8yrMwCSIQCEfowGAA  -
1.500s: flux-shell[1]: TRACE: pmi-simple: 1: C: cmd=barrier_in
1.499s: flux-shell[2]: TRACE: pmi-simple: 2: C: cmd=barrier_in
1.500s: flux-shell[2]: TRACE: pmi-simple: 2: S: cmd=barrier_out rc=0
1.500s: flux-shell[2]: TRACE: pmi-simple: 2: C: cmd=get kvsname=fEQLAnHv7 key=0-1-key0
1.500s: flux-shell[2]: TRACE: pmi-simple: 2: S: cmd=get_result rc=0 value=cG1peC5jcHVzZXQAMDMAMDAwNQAwLTk1AHBtaXguaG5hbWUAMDMAMDAwYQBjb3JvbmEyMDcAYnRsLm9wZW5pYi40LjEAMTQAMDAzNgABAAAAAAAAgP5bAAAABQAAAMkCAAAbEAAAAAEDPxjimwAAWwABAP6AAAAAAAAADEKhAwBarbxtdGwub2ZpLjQuMQAxNAAwMDIwAAM+nABbAAAAAAAAAAAAgP68rVoAA6FCDAAAAAAAAAAAcG1sLnVjeC40LjEAMTQAMDBkYwAg9gECSPg9FVIhEgECEZTtisloWbSYAADwAAAAAM8ZF7dROFOePkvX4Dc3ARMjAILmKyEGAALAqIDPzxkXt1E4ChouUCOtrjYAEyMAgpjjIQYAAsCoQM/PGRe3UTi/U+JMds/BNgETIwCCpPskgwBbAH45d8wrM+9IOFDbKjE1MsMjAAVAnAAxBCeKsA+hM+9IOFCwDyE1Mt8iAAB61HfMKzPvSDhQsA8hNTLfIgABADHRXHzhM+9IOFBFHSk1MgMnAANMnACzD3fMqzPvSDhQRR0pNTIDJwCDT5wAcG1sLnVjeC40LjEAMTQAMDE2NAAg9gECSPg9FVIgAGN1d8wrMgD4108AAAAAAE8DAIjub/N2VuY6fCESAQIRlO2KyWhZtJgAAPAAAAAAzxkXt1E4U54+S9fgNzcBEyMAguYrIQYAAsCogM/PGRe3UTgKGi5QI62uNgATIwCCmOMhBgACwKhAz88ZF7dROL9T4kx2z8E2ARMjAIKk+0IIEZTtisloWTSNU3fMKzIAAHBQd8yrMwBPIwCIOgANAAAAAABDCBGU7YrJaFk0p9N3zCsyAABwUHfMqzMATyMAiCy5PcAJAADAJAMAWwB+OXfMKzPvSDhQ2yoxNTLDIwAFQJwAMQQnirAPoTPvSDhQsA8hNTLfIgAAetR3zCsz70g4ULAPITUy3yIAAQAx0Vx84TPvSDhQRR0pNTIDJwADTJwAsw93zKsz70g4UEUdKTUyAycAg0+cACW
1.500s: flux-shell[2]: TRACE: pmi-simple: 2: C: cmd=get kvsname=fEQLAnHv7 key=0-1-key1
1.500s: flux-shell[2]: TRACE: pmi-simple: 2: S: cmd=get_result rc=0 value=IEZTtisloWTRHi703BjYAJC5Qd8yrMwCSIQCELLk9AA  -
1.500s: flux-shell[2]: TRACE: pmi-simple: 2: C: cmd=barrier_in
0: fEQLAnHv7: completed MPI_Init in 1.379s.  There are 3 tasks
0: fEQLAnHv7: completed first barrier in 0.016s
Sep 20 14:36:09.614316 broker.err[0]: rc2.0: /bin/bash Hangup (rc=129) 1796.2s
1800.980s: job.exception type=timeout severity=0 resource allocation expired
[detached: session exiting]

from flux-core.

garlick avatar garlick commented on September 26, 2024

Ah one red flag. The values being retrieved from the PMI KVS are exactly 1024 bytes in length. In simple PMI we have

#define SIMPLE_KVS_VAL_MAX         1024

Looking back to the puts, I see values of size 1053. Take this example, and note that the return code is 0 (success).

1.452s: flux-shell[1]: TRACE: pmi-simple: 1: C: cmd=put kvsname=fEQLAnHv7 key=0-1-key0 value=cG1peC5jcHVzZXQAMDMAMDAwNQAwLTk1AHBtaXguaG5hbWUAMDMAMDAwYQBjb3JvbmEyMDcAYnRsLm9wZW5pYi40LjEAMTQAMDAzNgABAAAAAAAAgP5bAAAABQAAAMkCAAAbEAAAAAEDPxjimwAAWwABAP6AAAAAAAAADEKhAwBarbxtdGwub2ZpLjQuMQAxNAAwMDIwAAM+nABbAAAAAAAAAAAAgP68rVoAA6FCDAAAAAAAAAAAcG1sLnVjeC40LjEAMTQAMDBkYwAg9gECSPg9FVIhEgECEZTtisloWbSYAADwAAAAAM8ZF7dROFOePkvX4Dc3ARMjAILmKyEGAALAqIDPzxkXt1E4ChouUCOtrjYAEyMAgpjjIQYAAsCoQM/PGRe3UTi/U+JMds/BNgETIwCCpPskgwBbAH45d8wrM+9IOFDbKjE1MsMjAAVAnAAxBCeKsA+hM+9IOFCwDyE1Mt8iAAB61HfMKzPvSDhQsA8hNTLfIgABADHRXHzhM+9IOFBFHSk1MgMnAANMnACzD3fMqzPvSDhQRR0pNTIDJwCDT5wAcG1sLnVjeC40LjEAMTQAMDE2NAAg9gECSPg9FVIgAGN1d8wrMgD4108AAAAAAE8DAIjub/N2VuY6fCESAQIRlO2KyWhZtJgAAPAAAAAAzxkXt1E4U54+S9fgNzcBEyMAguYrIQYAAsCogM/PGRe3UTgKGi5QI62uNgATIwCCmOMhBgACwKhAz88ZF7dROL9T4kx2z8E2ARMjAIKk+0IIEZTtisloWTSNU3fMKzIAAHBQd8yrMwBPIwCIOgANAAAAAABDCBGU7YrJaFk0p9N3zCsyAABwUHfMqzMATyMAiCy5PcAJAADAJAMAWwB+OXfMKzPvSDhQ2yoxNTLDIwAFQJwAMQQnirAPoTPvSDhQsA8hNTLfIgAAetR3zCsz70g4ULAPITUy3yIAAQAx0Vx84TPvSDhQRR0pNTIDJwADTJwAsw93zKsz70g4UEUdKTUyAycAg0+cACW
1.452s: flux-shell[1]: TRACE: pmi-simple: 1: S: cmd=put_result rc=0

from flux-core.

grondo avatar grondo commented on September 26, 2024

Wow, nice find!

from flux-core.

garlick avatar garlick commented on September 26, 2024

Eh except I must've shoved too much into wc when I got 1053. The puts are actually logged as 1024 also. There might still be truncation elsewhere though as that's a mighty suspicious number.

from flux-core.

garlick avatar garlick commented on September 26, 2024

I'm likely barking up the wrong tree here. I think what openmpi is doing is checking the value of the max value and splitting up large values over multiple keys. For example on successful runs it commonly does stuff like this where the first value is 1024 (the max) and the second one is less.

2.176s: flux-shell[1]: TRACE: pmi-simple: 1: C: cmd=put kvsname=fFTnifWX key=0-1-key0 value=cG1peC5jcHVzZXQAMDMAMDAwNQAwLTQ3AHBtaXguaG5hbWUAMDMAMDAwOQBjb3JvbmE4MgBidGwub3BlbmliLjQuMQAxNAAwMDUwAAIAAAAAAACA/vIAAAAFAAAAyQIAABsQAAAAAQM/GArTAADyAAEA/oAAAAAAAACYA5sDAGevDAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAbXRsLm9maS40LjEAMTQAMDAyMAADHdQA8gAAAAAAAAAAAID+DK9nAAObA5gAAAAAAAAAAHBtbC51Y3guNC4xADE0ADAxNzkAIKuI7ZFSG8YuIRIBAi62t6Wf/VHdmAAA8AAAAADPGRe3UThTnj5L1+A3NwETIwCC4lUhBgACwKiAUs8ZF7dROAoaLlAjra42ARMjAILnryEGAAKGCUcUzxkXt1E4bdGwUGaUrjYAEyMAgsPXIQYAAsCoQFLPGRe3UTi/U+JMds/BNgETIwCCvOUkAwDyAH45d8wrM+9IOFDbKjE1MsMjAAUz1AAzBCeKsA+hM+9IOFCwDyE1Mt8iAAB61HfMKzPvSDhQsA8hNTLfIgABADHRXHzhM+9IOFBFHSk1MgMnAANO1ACzD3fMqzPvSDhQRR0pNTIDJwCDUdQAJdICSgAAAAAAAAAAAAD//4YJRxQDfjl3zCszXSBSUMDaZjUywyMABYrkAHAEJ4qwD6EzXSBSUJW/VjUy3yIAAHrUd8wrM10gUlCVv1Y1Mt8iAAEAMdFcfOEzXSBSUCrNXjUyAycAA7rkALMPd8yrM10gUlAqzV41MgMnAIO95ABwbWwudWN4LjQuMQAxNAAwMjAxACCriO2RUhvGLiAAY3V3zCsyAPjXTwAAAAAATwMAiOtsPc4eezIuIRIBAi62t6Wf/VHdmAAA8AAAAADPGRe3UThTnj5L1+A3NwETIwCC4lUhBgACwKiAUs8ZF7dROAoaLlAjra42ARMjAILnryEGAAKGCUcUzxkXt1E4bdGwUGaUrjYAEyMAgsPXIQYAAsC
2.176s: flux-shell[1]: TRACE: pmi-simple: 1: S: cmd=put_result rc=0
2.183s: flux-shell[1]: TRACE: pmi-simple: 1: C: cmd=put kvsname=fFTnifWX key=0-1-key1 value=oQFLPGRe3UTi/U+JMds/BNgETIwCCvOVCCC62t6Wf/VFdjVN3zCsyAABwUHfMqzMATyMAiAYAKAAAAAAAQwgutreln/1RXafTd8wrMgAAcFB3zKszAE8jAIhOchdADAAAwCQDAPIAfjl3zCsz70g4UNsqMTUywyMABTPUADMEJ4qwD6Ez70g4ULAPITUy3yIAAHrUd8wrM+9IOFCwDyE1Mt8iAAEAMdFcfOEz70g4UEUdKTUyAycAA07UALMPd8yrM+9IOFBFHSk1MgMnAINR1AAlUgJKAAAAAAAAAAAAAP//hglHFAN+OXfMKzNdIFJQwNpmNTLDIwAFiuQAcAQnirAPoTNdIFJQlb9WNTLfIgAAetR3zCszXSBSUJW/VjUy3yIAAQAx0Vx84TNdIFJQKs1eNTIDJwADuuQAsw93zKszXSBSUCrNXjUyAycAg73kACaILra3pZ/9UV1Hi703BjYAJC5Qd8yrMwCSIQCETnIXAA  -
2.183s: flux-shell[1]: TRACE: pmi-simple: 1: S: cmd=put_result rc=0

from flux-core.

garlick avatar garlick commented on September 26, 2024

Looking at the ompi source referenced above it seems that opal_common_ucx_del_procs() first calls opal_common_ucx_del_procs_nofence(), then calls the opal_common_ucx_mca_pmix_fence() (which executes the pmi barrier). Traces show one rank that made it all the way to the PMI barrier, and the other ranks stuck in opal_common_ucx_del_procs_nofence() -> opal_common_ucx_wait_all_requests().

So basically if the latter ever finishes, the straggler rank will enter the PMI barrier.

Turning on some debug and reproducing the hang (and snipping out all the ucx debug except the disconnect stuff):

$ flux run --label-io -N3 --env=OMPI_MCA_pml_ucx_verbose=3 ./hello
0: f2AZvtYab: completed MPI_Init in 1.384s.  There are 3 tasks
0: f2AZvtYab: completed first barrier in 0.015s
0: [corona188:3967920] common_ucx.c:440 disconnecting from rank 0
0: [corona188:3967920] common_ucx.c:440 disconnecting from rank 1
0: [corona188:3967920] common_ucx.c:404 waiting for 1 disconnect requests
2: [corona198:65489] common_ucx.c:440 disconnecting from rank 2
2: [corona198:65489] common_ucx.c:440 disconnecting from rank 0
0: [corona188:3967920] common_ucx.c:440 disconnecting from rank 2
2: [corona198:65489] common_ucx.c:404 waiting for 1 disconnect requests
0: [corona188:3967920] common_ucx.c:404 waiting for 1 disconnect requests
1: [corona189:3811294] common_ucx.c:440 disconnecting from rank 0
2: [corona198:65489] common_ucx.c:404 waiting for 0 disconnect requests
1: [corona189:3811294] common_ucx.c:404 waiting for 1 disconnect requests
1: [corona189:3811294] common_ucx.c:440 disconnecting from rank 1
1: [corona189:3811294] common_ucx.c:404 waiting for 0 disconnect requests
[hang]

Looking at the ranks with gdb:

  • rank 0 is stuck in opal_common_ucx_wait_all_requests ()
  • rank 1 is stuck in pmi_simple_client_barrier ()
  • rank 2 is stuck in pmi_simple_client_barrier ()

Here's the same test run successfully with pmix, just the disconnect messages

0: [corona188:3968360] common_ucx.c:440 disconnecting from rank 0
0: [corona188:3968360] common_ucx.c:440 disconnecting from rank 1
0: [corona188:3968360] common_ucx.c:404 waiting for 1 disconnect requests
2: [corona198:65924] common_ucx.c:440 disconnecting from rank 2
0: [corona188:3968360] common_ucx.c:440 disconnecting from rank 2
0: [corona188:3968360] common_ucx.c:404 waiting for 1 disconnect requests
1: [corona189:3811731] common_ucx.c:440 disconnecting from rank 0
0: [corona188:3968360] common_ucx.c:404 waiting for 0 disconnect requests
2: [corona198:65924] common_ucx.c:440 disconnecting from rank 0
1: [corona189:3811731] common_ucx.c:404 waiting for 1 disconnect requests
2: [corona198:65924] common_ucx.c:404 waiting for 1 disconnect requests
1: [corona189:3811731] common_ucx.c:440 disconnecting from rank 1
2: [corona198:65924] common_ucx.c:404 waiting for 0 disconnect requests
1: [corona189:3811731] common_ucx.c:404 waiting for 0 disconnect requests

As far as I can tell, the only difference is that in the failing trace, rank 0 never gets down to "waiting for 0 disconnect requests" even though it would appear that the other two ranks have sent them.

from flux-core.

garlick avatar garlick commented on September 26, 2024

As noted above, each rank calls opal_common_ucx_del_procs(), which sequentially calls

  • opal_common_ucx_del_procs_nofence()
  • opal_common_ucx_mca_pmix_fence()

Looking a bit deeper

  • opal_common_ucx_del_procs_nofence() sends disconnect requests with ucp_disconnect_nb()
  • opal_common_ucx_mca_pmix_fence() enters the PMI barrier by calling opal_pmix.fence_nb()
  • after calling fence_nb() a "progress loop" is entered.

The _nb versions of these functions are supposed to be non blocking, and UCX is making full use of this to concurrently wait for the barrier and get the disconnect requests sent in the progress loop.

However, the flux MCA plugin in openmpi implements fence_nb() in a blocking manner because it's based on PMI-1 and PMI-1 does not offer non-blocking interfaces. So a theory is that some disconnect messages don't make it out because the (unfortunately blocking) fence_nb() is called before the progress loop can process a queue of outbound messages.

Reworking the flux MCA plugin in openmpi to make the fence_nb() truly non-blocking is a nontrivial effort, and once complete would require rebuilding all the ompis. Given that and the fact that ompi dropped the flux MCA plugin in v5 and requires all launchers to implement pmix, running ompi + UCX with under flux with -o pmi=pmix really is the preferred solution to this.

from flux-core.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.