Comments (10)
The STAT graph is as follows (courtesy @lee218llnl)
ompi source ref (courtesy Tom)
from flux-core.
Hmm I seem to be able to reproduce this pretty consistently with a 3 node run of a simple MPI hello program.
[garlick@corona206:mpi]$ module list
Currently Loaded Modules:
1) flux_wrappers/0.1 3) gcc/10.3.1-magic
2) StdEnv (S) 4) openmpi/4.1.2
Where:
S: Module is Sticky, requires --force to unload or purge
[garlick@corona206:mpi]$ flux run -N3 -o verbose=2 ./hello
f2v32KFEw: completed MPI_Init in 1.379s. There are 3 tasks
f2v32KFEw: completed first barrier in 0.017s
[snip]
^Cflux-job: one more ctrl-C within 2s to cancel or ctrl-Z to detach
^C293.877s: flux-shell[0]: DEBUG: signals: forwarding signal 15 to tasks
293.877s: flux-shell[1]: DEBUG: signals: forwarding signal 15 to tasks
293.877s: job.exception type=cancel severity=0 interrupted by ctrl-C
293.877s: flux-shell[2]: DEBUG: signals: forwarding signal 15 to tasks
293.913s: flux-shell[0]: TRACE: pmi-simple: 0: C: pmi EOF
293.915s: flux-shell[0]: DEBUG: task 0 complete status=143
293.915s: flux-shell[1]: TRACE: pmi-simple: 1: C: pmi EOF
293.918s: flux-shell[1]: DEBUG: task 1 complete status=143
293.927s: flux-shell[1]: DEBUG: exit 143
298.918s: flux-shell[2]: TRACE: pmi-simple: 2: C: pmi EOF
flux-job: task(s) exited with exit code 143
If I attach wtih gdb while it's in the hung state:
rank 0
(gdb) bt
#0 0x00001555544abe87 in epoll_wait () from /lib64/libc.so.6
#1 0x000015554a1e49d9 in ucs_event_set_wait () from /lib64/libucs.so.0
#2 0x00001555500f1e2b in uct_tcp_iface_progress () from /lib64/libuct.so.0
#3 0x000015554a673a2a in ucp_worker_progress () from /lib64/libucp.so.0
#4 0x000015555054d5c4 in opal_common_ucx_wait_all_requests ()
from /usr/tce/backend/installations/linux-rhel8-x86_64/gcc-10.3.1/openmpi-4.1.2-66kpmbgreoruj2exhnizo3iwcd2boybx/lib/libmca_common_ucx.so.40
#5 0x000015555054e091 in opal_common_ucx_del_procs_nofence ()
from /usr/tce/backend/installations/linux-rhel8-x86_64/gcc-10.3.1/openmpi-4.1.2-66kpmbgreoruj2exhnizo3iwcd2boybx/lib/libmca_common_ucx.so.40
#6 0x000015555054e0b9 in opal_common_ucx_del_procs ()
from /usr/tce/backend/installations/linux-rhel8-x86_64/gcc-10.3.1/openmpi-4.1.2-66kpmbgreoruj2exhnizo3iwcd2boybx/lib/libmca_common_ucx.so.40
#7 0x0000155550556889 in mca_pml_ucx_del_procs ()
from /usr/tce/backend/installations/linux-rhel8-x86_64/gcc-10.3.1/openmpi-4.1.2-66kpmbgreoruj2exhnizo3iwcd2boybx/lib/openmpi/mca_pml_ucx.so
#8 0x000015555546ad82 in ompi_mpi_finalize ()
from /usr/tce/packages/openmpi/openmpi-4.1.2-gcc-10.3.1/lib/libmpi.so.40
#9 0x0000000000401151 in main (argc=<optimized out>, argv=<optimized out>)
at mpi/hello.c:53
rank 1
(gdb) bt
#0 0x0000155554ad5ab4 in read () from /lib64/libpthread.so.0
#1 0x0000155552340ed2 in read (__nbytes=1, __buf=0x499370, __fd=17)
at /usr/include/bits/unistd.h:36
#2 dgetline (fd=17,
buf=0x499370 "cmd=get_result rc=0 value=IzOvZq3rrHQJHi703BjYAJC5Qd8yrMwCSIQCELYoGAA -\n", len=1216) at dgetline.c:24
#3 0x000015555233f319 in pmi_simple_client_barrier (pmi=0x499330)
at simple_client.c:167
#4 0x0000155552e23db9 in flux_fencenb ()
from /usr/tce/backend/installations/linux-rhel8-x86_64/gcc-10.3.1/openmpi-4.1.2-66kpmbgreoruj2exhnizo3iwcd2boybx/lib/openmpi/mca_pmix_flux.so
#5 0x000015555054defc in opal_common_ucx_mca_pmix_fence ()
from /usr/tce/backend/installations/linux-rhel8-x86_64/gcc-10.3.1/openmpi-4.1.2-66kpmbgreoruj2exhnizo3iwcd2boybx/lib/libmca_common_ucx.so.40
#6 0x0000155550556889 in mca_pml_ucx_del_procs ()
from /usr/tce/backend/installations/linux-rhel8-x86_64/gcc-10.3.1/openmpi-4.1.2-66kpmbgreoruj2exhnizo3iwcd2boybx/lib/openmpi/mca_pml_ucx.so
#7 0x000015555546ad82 in ompi_mpi_finalize ()
from /usr/tce/packages/openmpi/openmpi-4.1.2-gcc-10.3.1/lib/libmpi.so.40
#8 0x0000000000401151 in main (argc=<optimized out>, argv=<optimized out>)
at mpi/hello.c:53
rank 2
(gdb) bt
#0 0x0000155554ad5ab4 in read () from /lib64/libpthread.so.0
#1 0x0000155552340ed2 in read (__nbytes=1, __buf=0x4990c0, __fd=17)
at /usr/include/bits/unistd.h:36
#2 dgetline (fd=17,
buf=0x4990c0 "cmd=get_result rc=0 value=IEZTtisloWTRHi703BjYAJC5Qd8yrMwCSIQCE7bY9AA -\n", len=1216) at dgetline.c:24
#3 0x000015555233f319 in pmi_simple_client_barrier (pmi=0x499080)
at simple_client.c:167
#4 0x0000155552e23db9 in flux_fencenb ()
from /usr/tce/backend/installations/linux-rhel8-x86_64/gcc-10.3.1/openmpi-4.1.2-66kpmbgreoruj2exhnizo3iwcd2boybx/lib/openmpi/mca_pmix_flux.so
#5 0x000015555054defc in opal_common_ucx_mca_pmix_fence ()
from /usr/tce/backend/installations/linux-rhel8-x86_64/gcc-10.3.1/openmpi-4.1.2-66kpmbgreoruj2exhnizo3iwcd2boybx/lib/libmca_common_ucx.so.40
#6 0x0000155550556889 in mca_pml_ucx_del_procs ()
from /usr/tce/backend/installations/linux-rhel8-x86_64/gcc-10.3.1/openmpi-4.1.2-66kpmbgreoruj2exhnizo3iwcd2boybx/lib/openmpi/mca_pml_ucx.so
#7 0x000015555546ad82 in ompi_mpi_finalize ()
from /usr/tce/packages/openmpi/openmpi-4.1.2-gcc-10.3.1/lib/libmpi.so.40
#8 0x0000000000401151 in main (argc=<optimized out>, argv=<optimized out>)
at mpi/hello.c:53
from flux-core.
pmix trace of the same 3 node hello run but with -o pmi=pmix
which works for some reason. Maybe there's some clue to be found here.
[garlick@corona206:mpi]$ flux run -o verbose=2 -o pmi=pmix -N3 ./hello
0.045s: flux-shell[2]: DEBUG: Loading /etc/flux/shell/initrc.lua
0.045s: flux-shell[2]: TRACE: Successfully loaded flux.shell module
0.045s: flux-shell[2]: TRACE: trying to load /etc/flux/shell/initrc.lua
0.047s: flux-shell[2]: DEBUG: data-staging: Jobspec does not contain data-staging attributes. No staging necessary.
0.048s: flux-shell[2]: DEBUG: pmix: server is enabled
0.048s: flux-shell[2]: TRACE: trying to load /etc/flux/shell/lua.d/ascii.lua
0.051s: flux-shell[2]: TRACE: trying to load /etc/flux/shell/lua.d/intel_mpi.lua
0.051s: flux-shell[2]: TRACE: trying to load /etc/flux/shell/lua.d/mpibind.lua
0.052s: flux-shell[2]: DEBUG: mpibind: disabling cpu-affinity
0.052s: flux-shell[2]: DEBUG: mpibind: disabling gpu-affinity
0.052s: flux-shell[2]: TRACE: trying to load /etc/flux/shell/lua.d/openmpi.lua
0.053s: flux-shell[2]: TRACE: trying to load /etc/flux/shell/lua.d/rocr.lua
0.053s: flux-shell[2]: DEBUG: cpu-affinity: disabling affinity due to cpu-affinity=off
0.053s: flux-shell[2]: DEBUG: gpu-affinity: disabling affinity due to gpu-affinity=off
0.053s: flux-shell[2]: DEBUG: pmix: jobid = 15511811260416
0.053s: flux-shell[2]: DEBUG: pmix: shell_rank = 2
0.053s: flux-shell[2]: DEBUG: pmix: local_nprocs = 1
0.053s: flux-shell[2]: DEBUG: pmix: total_nprocs = 3
0.058s: flux-shell[2]: DEBUG: pmix: local_peers = 2
0.058s: flux-shell[2]: DEBUG: pmix: node_map = corona206,corona207,corona214
0.058s: flux-shell[2]: DEBUG: pmix: proc_map = 0;1;2
0.059s: flux-shell[2]: DEBUG: mpibind: loaded topology from job shell
0.061s: flux-shell[2]: DEBUG: mpibind: flux given cores: 0-47
0.061s: flux-shell[2]: DEBUG: mpibind: derived pus: 0-95
0.061s: flux-shell[2]: DEBUG: mpibind: flux given gpus: 0-7
0.061s: flux-shell[2]: DEBUG: mpibind: total #cores: 48
0.061s: flux-shell[2]: DEBUG: mpibind: total #pus: 96
0.061s: flux-shell[2]: DEBUG: mpibind: user opts: ntasks=1 nthreads=0 restrict=0-95 greedy=-1 smt=-1 gpu_optim=-1 verbose=0 master=0 corespec_first=0 corespec_numa=0 corespec_bal=0 visible_devices=0 omp_proc_bind=0 omp_places=0 xml=shell-provided
0.061s: flux-shell[2]: DEBUG: mpibind: task 0: cpus 0-95
0.061s: flux-shell[2]: DEBUG: 2: task 2 on cores 0-47
0.064s: flux-shell[2]: DEBUG: mpibind: task 0: setting OMP_NUM_THREADS=96
0.064s: flux-shell[2]: DEBUG: mpibind: task 0: setting OMP_PLACES=threads
0.065s: flux-shell[2]: DEBUG: mpibind: task 0: setting OMP_PROC_BIND=spread
0.065s: flux-shell[2]: DEBUG: mpibind: task 0: setting ROCR_VISIBLE_DEVICES=2,3,1,0,6,7,5,4
0.045s: flux-shell[1]: DEBUG: Loading /etc/flux/shell/initrc.lua
0.046s: flux-shell[1]: TRACE: Successfully loaded flux.shell module
0.046s: flux-shell[1]: TRACE: trying to load /etc/flux/shell/initrc.lua
0.048s: flux-shell[1]: DEBUG: data-staging: Jobspec does not contain data-staging attributes. No staging necessary.
0.049s: flux-shell[1]: DEBUG: pmix: server is enabled
0.049s: flux-shell[1]: TRACE: trying to load /etc/flux/shell/lua.d/ascii.lua
0.053s: flux-shell[1]: TRACE: trying to load /etc/flux/shell/lua.d/intel_mpi.lua
0.053s: flux-shell[1]: TRACE: trying to load /etc/flux/shell/lua.d/mpibind.lua
0.054s: flux-shell[1]: DEBUG: mpibind: disabling cpu-affinity
0.054s: flux-shell[1]: DEBUG: mpibind: disabling gpu-affinity
0.054s: flux-shell[1]: TRACE: trying to load /etc/flux/shell/lua.d/openmpi.lua
0.054s: flux-shell[1]: TRACE: trying to load /etc/flux/shell/lua.d/rocr.lua
0.054s: flux-shell[1]: DEBUG: cpu-affinity: disabling affinity due to cpu-affinity=off
0.054s: flux-shell[1]: DEBUG: gpu-affinity: disabling affinity due to gpu-affinity=off
0.054s: flux-shell[1]: DEBUG: pmix: jobid = 15511811260416
0.054s: flux-shell[1]: DEBUG: pmix: shell_rank = 1
0.054s: flux-shell[1]: DEBUG: pmix: local_nprocs = 1
0.054s: flux-shell[1]: DEBUG: pmix: total_nprocs = 3
0.060s: flux-shell[1]: DEBUG: pmix: local_peers = 1
0.060s: flux-shell[1]: DEBUG: pmix: node_map = corona206,corona207,corona214
0.060s: flux-shell[1]: DEBUG: pmix: proc_map = 0;1;2
0.061s: flux-shell[1]: DEBUG: mpibind: loaded topology from job shell
0.063s: flux-shell[1]: DEBUG: mpibind: flux given cores: 0-47
0.063s: flux-shell[1]: DEBUG: mpibind: derived pus: 0-95
0.063s: flux-shell[1]: DEBUG: mpibind: flux given gpus: 0-7
0.063s: flux-shell[1]: DEBUG: mpibind: total #cores: 48
0.063s: flux-shell[1]: DEBUG: mpibind: total #pus: 96
0.063s: flux-shell[1]: DEBUG: mpibind: user opts: ntasks=1 nthreads=0 restrict=0-95 greedy=-1 smt=-1 gpu_optim=-1 verbose=0 master=0 corespec_first=0 corespec_numa=0 corespec_bal=0 visible_devices=0 omp_proc_bind=0 omp_places=0 xml=shell-provided
0.063s: flux-shell[1]: DEBUG: mpibind: task 0: cpus 0-95
0.063s: flux-shell[1]: DEBUG: 1: task 1 on cores 0-47
0.064s: flux-shell[1]: DEBUG: mpibind: task 0: setting OMP_NUM_THREADS=96
0.065s: flux-shell[1]: DEBUG: mpibind: task 0: setting OMP_PLACES=threads
0.065s: flux-shell[1]: DEBUG: mpibind: task 0: setting OMP_PROC_BIND=spread
0.065s: flux-shell[1]: DEBUG: mpibind: task 0: setting ROCR_VISIBLE_DEVICES=2,3,1,0,6,7,5,4
0.043s: flux-shell[0]: DEBUG: Loading /etc/flux/shell/initrc.lua
0.043s: flux-shell[0]: TRACE: Successfully loaded flux.shell module
0.043s: flux-shell[0]: TRACE: trying to load /etc/flux/shell/initrc.lua
0.045s: flux-shell[0]: DEBUG: data-staging: Jobspec does not contain data-staging attributes. No staging necessary.
0.046s: flux-shell[0]: DEBUG: pmix: server is enabled
0.046s: flux-shell[0]: TRACE: trying to load /etc/flux/shell/lua.d/ascii.lua
0.050s: flux-shell[0]: TRACE: trying to load /etc/flux/shell/lua.d/intel_mpi.lua
0.050s: flux-shell[0]: TRACE: trying to load /etc/flux/shell/lua.d/mpibind.lua
0.051s: flux-shell[0]: DEBUG: mpibind: disabling cpu-affinity
0.051s: flux-shell[0]: DEBUG: mpibind: disabling gpu-affinity
0.051s: flux-shell[0]: TRACE: trying to load /etc/flux/shell/lua.d/openmpi.lua
0.051s: flux-shell[0]: TRACE: trying to load /etc/flux/shell/lua.d/rocr.lua
0.052s: flux-shell[0]: DEBUG: output: batch timeout = 0.500s
0.052s: flux-shell[0]: DEBUG: cpu-affinity: disabling affinity due to cpu-affinity=off
0.052s: flux-shell[0]: DEBUG: gpu-affinity: disabling affinity due to gpu-affinity=off
0.052s: flux-shell[0]: DEBUG: pmix: jobid = 15511811260416
0.052s: flux-shell[0]: DEBUG: pmix: shell_rank = 0
0.052s: flux-shell[0]: DEBUG: pmix: local_nprocs = 1
0.052s: flux-shell[0]: DEBUG: pmix: total_nprocs = 3
0.052s: flux-shell[0]: DEBUG: pmix: server outsourced to 3.2.5a1
0.058s: flux-shell[0]: DEBUG: pmix: local_peers = 0
0.058s: flux-shell[0]: DEBUG: pmix: node_map = corona206,corona207,corona214
0.058s: flux-shell[0]: DEBUG: pmix: proc_map = 0;1;2
0.059s: flux-shell[0]: DEBUG: mpibind: loaded topology from job shell
0.061s: flux-shell[0]: DEBUG: mpibind: flux given cores: 0-47
0.061s: flux-shell[0]: DEBUG: mpibind: derived pus: 0-95
0.061s: flux-shell[0]: DEBUG: mpibind: flux given gpus: 0-7
0.061s: flux-shell[0]: DEBUG: mpibind: total #cores: 48
0.061s: flux-shell[0]: DEBUG: mpibind: total #pus: 96
0.061s: flux-shell[0]: DEBUG: mpibind: user opts: ntasks=1 nthreads=0 restrict=0-95 greedy=-1 smt=-1 gpu_optim=-1 verbose=0 master=0 corespec_first=0 corespec_numa=0 corespec_bal=0 visible_devices=0 omp_proc_bind=0 omp_places=0 xml=shell-provided
0.061s: flux-shell[0]: DEBUG: mpibind: task 0: cpus 0-95
0.061s: flux-shell[0]: DEBUG: 0: task_count=3 slot_count=3 cores_per_slot=1 slots_per_node=1
0.061s: flux-shell[0]: DEBUG: 0: task 0 on cores 0-47
0.065s: flux-shell[0]: DEBUG: mpibind: task 0: setting OMP_NUM_THREADS=96
0.065s: flux-shell[0]: DEBUG: mpibind: task 0: setting OMP_PLACES=threads
0.065s: flux-shell[0]: DEBUG: mpibind: task 0: setting OMP_PROC_BIND=spread
0.065s: flux-shell[0]: DEBUG: mpibind: task 0: setting ROCR_VISIBLE_DEVICES=2,3,1,0,6,7,5,4
1.424s: flux-shell[1]: TRACE: pmix: pmix server fence_upcall {"procs":[{"nspace":"f82UCeKLK","rank":-2}],"info":[{"key":"pmix.collect","flags":0,"value":{"type":1,"data":true}}],"data":"AAAAAQAPAAAAAAAAAbAAAAABAQAAAAEADwAAAAAAAAGdAAAAAQAAAAEAAAABAAAAD2J0bC5vcGVuaWIuNC4xAAAbAA8AAAAAAAAANgEAAAAAAACA/lsAAAAFAAAAyQIAABsQAAAAAQM/GACbAABbAAEA/oAAAAAAAAAMQqEDAFqtvAAAAAEAAAAMbXRsLm9maS40LjEAABsADwAAAAAAAAAgA1ybAFsAAAAAAAAAAACA/rytWgADoUIMAAAAAAAAAAAAAAABAAAADHBtbC51Y3guNC4xAAAbAA8AAAAAAAAA3CBmaBrU154oJCESAQIRlO2KyWhZtJgAAPAAAAAAzxkXt1E4U54+S9fgNzcBEyMAgoy/IQYAAsCogM/PGRe3UTgKGi5QI62uNgATIwCCxHshBgACwKhAz88ZF7dROL9T4kx2z8E2ARMjAIKC1ySDAFsAfjl3zCsz70g4UNsqMTUywyMABV6bADEEJ4qwD6Ez70g4ULAPITUy3yIAAHrUd8wrM+9IOFCwDyE1Mt8iAAEAMdFcfOEz70g4UEUdKTUyAycAA2qbALMPd8yrM+9IOFBFHSk1MgMnAINtmwA=","cbfunc":23455932359280,"cbdata":23455756017600}
1.424s: flux-shell[1]: TRACE: pmix: starting pmix exchange 0: size 446
1.440s: flux-shell[2]: TRACE: pmix: pmix server fence_upcall {"procs":[{"nspace":"f82UCeKLK","rank":-2}],"info":[{"key":"pmix.collect","flags":0,"value":{"type":1,"data":true}}],"data":"AAAAAQAPAAAAAAAAAbAAAAABAQAAAAEADwAAAAAAAAGdAAAAAQAAAAIAAAABAAAAD2J0bC5vcGVuaWIuNC4xAAAbAA8AAAAAAAAANgEAAAAAAACA/mIAAAAFAAAAyQIAABsQAAAAAQM/GAJmAQBiAAEA/oAAAAAAAAAMQqEDAFqs4AAAAAEAAAAMbXRsLm9maS40LjEAABsADwAAAAAAAAAgA15mAWIAAAAAAAAAAACA/uCsWgADoUIMAAAAAAAAAAAAAAABAAAADHBtbC51Y3guNC4xAAAbAA8AAAAAAAAA3CCYh16fZtn0WiEGAALAqIDWzxkXt1E4ChouUCOtrjYAEyMAgot7IQYAAsCoQNbPGRe3UTi/U+JMds/BNgETIwCC7hEhEgECzOvZq3rrHYKYAADwAAAAAM8ZF7dROFOePkvX4Dc3ARMjAIKOJySDAGIAfjl3zCsz70g4UNsqMTUywyMABWFmAQgEJ4qwD6Ez70g4ULAPITUy3yIAAHrUd8wrM+9IOFCwDyE1Mt8iAAEAMdFcfOEz70g4UEUdKTUyAycAA2tmAbMPd8yrM+9IOFBFHSk1MgMnAINvZgE=","cbfunc":23455932359280,"cbdata":23455756017600}
1.440s: flux-shell[2]: TRACE: pmix: starting pmix exchange 0: size 446
1.445s: flux-shell[2]: TRACE: pmix: completed pmix exchange 0: size 1375 SUCCESS
1.444s: flux-shell[0]: TRACE: pmix: pmix server fence_upcall {"procs":[{"nspace":"f82UCeKLK","rank":-2}],"info":[{"key":"pmix.collect","flags":0,"value":{"type":1,"data":true}}],"data":"AAAAAQAPAAAAAAAAAdUAAAABAQAAAAEADwAAAAAAAAHCAAAAAQAAAAAAAAABAAAAD2J0bC5vcGVuaWIuNC4xAAAbAA8AAAAAAAAANgEAAAAAAACA/oEAAAAFAAAAyQIAABsQAAAAAQM/GEBuAACBAAEA/oAAAAAAAAAMQqEDAFqtwAAAAAEAAAAMbXRsLm9maS40LjEAABsADwAAAAAAAAAgA5xuAIEAAAAAAAAAAACA/sCtWgADoUIMAAAAAAAAAAAAAAABAAAADHBtbC51Y3guNC4xAAAbAA8AAAAAAAAA3CA0gG7ZUEHjqSEGAALAqIDOzxkXt1E4ChouUCOtrjYAEyMAgp49IQYAAsCoQM7PGRe3UTi/U+JMds/BNgETIwCCphkhEgECbxGyCj1wG76YAADwAAAAAM8ZF7dROFOePkvX4Dc3ARMjAIKBeySDAIEAfjl3zCsz70g4UNsqMTUywyMABZ5uABgEJ4qwD6Ez70g4ULAPITUy3yIAAHrUd8wrM+9IOFCwDyE1Mt8iAAEAMdFcfOEz70g4UEUdKTUyAycAA6puALMPd8yrM+9IOFBFHSk1MgMnAIOtbgAAAAABAAAADXBtbC5iYXNlLjIuMAAAGwAPAAAAAAAAAAR1Y3gA","cbfunc":23455932359280,"cbdata":23455756017600}
1.444s: flux-shell[0]: TRACE: pmix: starting pmix exchange 0: size 483
1.445s: flux-shell[0]: TRACE: pmix: completed pmix exchange 0: size 1375 SUCCESS
1.445s: flux-shell[1]: TRACE: pmix: completed pmix exchange 0: size 1375 SUCCESS
1.459s: flux-shell[2]: TRACE: pmix: pmix server fence_upcall {"procs":[{"nspace":"f82UCeKLK","rank":-2}],"info":[],"data":"","cbfunc":23455932359280,"cbdata":23455756017600}
1.459s: flux-shell[2]: TRACE: pmix: starting pmix exchange 1: size 0
1.460s: flux-shell[2]: TRACE: pmix: completed pmix exchange 1: size 0 SUCCESS
1.459s: flux-shell[1]: TRACE: pmix: pmix server fence_upcall {"procs":[{"nspace":"f82UCeKLK","rank":-2}],"info":[],"data":"","cbfunc":23455932359280,"cbdata":23455756017600}
1.459s: flux-shell[1]: TRACE: pmix: starting pmix exchange 1: size 0
1.460s: flux-shell[1]: TRACE: pmix: completed pmix exchange 1: size 0 SUCCESS
1.460s: flux-shell[0]: TRACE: pmix: pmix server fence_upcall {"procs":[{"nspace":"f82UCeKLK","rank":-2}],"info":[],"data":"","cbfunc":23455932359280,"cbdata":23455756017600}
1.460s: flux-shell[0]: TRACE: pmix: starting pmix exchange 1: size 0
1.460s: flux-shell[0]: TRACE: pmix: completed pmix exchange 1: size 0 SUCCESS
1.472s: flux-shell[1]: TRACE: pmix: pmix server fence_upcall {"procs":[{"nspace":"f82UCeKLK","rank":-2}],"info":[],"data":"","cbfunc":23455932359280,"cbdata":23455756017600}
1.472s: flux-shell[1]: TRACE: pmix: starting pmix exchange 2: size 0
1.476s: flux-shell[1]: TRACE: pmix: completed pmix exchange 2: size 0 SUCCESS
1.476s: flux-shell[1]: TRACE: pmix: pmix server fence_upcall {"procs":[{"nspace":"f82UCeKLK","rank":-2}],"info":[],"data":"","cbfunc":23455932359280,"cbdata":23455756017600}
1.476s: flux-shell[1]: TRACE: pmix: starting pmix exchange 3: size 0
1.476s: flux-shell[1]: TRACE: pmix: completed pmix exchange 3: size 0 SUCCESS
1.475s: flux-shell[0]: TRACE: pmix: pmix server fence_upcall {"procs":[{"nspace":"f82UCeKLK","rank":-2}],"info":[],"data":"","cbfunc":23455932359280,"cbdata":23455756017600}
1.475s: flux-shell[0]: TRACE: pmix: starting pmix exchange 2: size 0
1.476s: flux-shell[0]: TRACE: pmix: completed pmix exchange 2: size 0 SUCCESS
1.476s: flux-shell[0]: TRACE: pmix: pmix server fence_upcall {"procs":[{"nspace":"f82UCeKLK","rank":-2}],"info":[],"data":"","cbfunc":23455932359280,"cbdata":23455756017600}
1.476s: flux-shell[0]: TRACE: pmix: starting pmix exchange 3: size 0
1.476s: flux-shell[0]: TRACE: pmix: completed pmix exchange 3: size 0 SUCCESS
1.475s: flux-shell[2]: TRACE: pmix: pmix server fence_upcall {"procs":[{"nspace":"f82UCeKLK","rank":-2}],"info":[],"data":"","cbfunc":23455932359280,"cbdata":23455756017600}
1.475s: flux-shell[2]: TRACE: pmix: starting pmix exchange 2: size 0
1.476s: flux-shell[2]: TRACE: pmix: completed pmix exchange 2: size 0 SUCCESS
1.476s: flux-shell[2]: TRACE: pmix: pmix server fence_upcall {"procs":[{"nspace":"f82UCeKLK","rank":-2}],"info":[],"data":"","cbfunc":23455932359280,"cbdata":23455756017600}
1.476s: flux-shell[2]: TRACE: pmix: starting pmix exchange 3: size 0
1.476s: flux-shell[2]: TRACE: pmix: completed pmix exchange 3: size 0 SUCCESS
1.560s: flux-shell[1]: DEBUG: task 1 complete status=0
1.560s: flux-shell[2]: DEBUG: task 2 complete status=0
1.570s: flux-shell[1]: DEBUG: exit 0
1.571s: flux-shell[2]: DEBUG: exit 0
1.562s: flux-shell[0]: DEBUG: task 0 complete status=0
f82UCeKLK: completed MPI_Init in 1.385s. There are 3 tasks
f82UCeKLK: completed first barrier in 0.009s
f82UCeKLK: completed MPI_Finalize in 0.083s
1.573s: flux-shell[0]: DEBUG: exit 0
from flux-core.
And the failing one with simple pmi
[garlick@corona206:mpi]$ flux run --label-io -o verbose=2 -N3 ./hello
0.066s: flux-shell[2]: DEBUG: Loading /etc/flux/shell/initrc.lua
0.066s: flux-shell[2]: TRACE: Successfully loaded flux.shell module
0.066s: flux-shell[2]: TRACE: trying to load /etc/flux/shell/initrc.lua
0.069s: flux-shell[2]: DEBUG: data-staging: Jobspec does not contain data-staging attributes. No staging necessary.
0.070s: flux-shell[2]: TRACE: trying to load /etc/flux/shell/lua.d/ascii.lua
0.073s: flux-shell[2]: TRACE: trying to load /etc/flux/shell/lua.d/intel_mpi.lua
0.073s: flux-shell[2]: TRACE: trying to load /etc/flux/shell/lua.d/mpibind.lua
0.077s: flux-shell[2]: DEBUG: mpibind: disabling cpu-affinity
0.077s: flux-shell[2]: DEBUG: mpibind: disabling gpu-affinity
0.077s: flux-shell[2]: TRACE: trying to load /etc/flux/shell/lua.d/openmpi.lua
0.077s: flux-shell[2]: TRACE: trying to load /etc/flux/shell/lua.d/rocr.lua
0.077s: flux-shell[2]: DEBUG: pmi-simple: simple wire protocol is enabled
0.077s: flux-shell[2]: DEBUG: cpu-affinity: disabling affinity due to cpu-affinity=off
0.077s: flux-shell[2]: DEBUG: gpu-affinity: disabling affinity due to gpu-affinity=off
0.078s: flux-shell[2]: DEBUG: mpibind: loaded topology from job shell
0.080s: flux-shell[2]: DEBUG: mpibind: flux given cores: 0-47
0.080s: flux-shell[2]: DEBUG: mpibind: derived pus: 0-95
0.080s: flux-shell[2]: DEBUG: mpibind: flux given gpus: 0-7
0.080s: flux-shell[2]: DEBUG: mpibind: total #cores: 48
0.080s: flux-shell[2]: DEBUG: mpibind: total #pus: 96
0.080s: flux-shell[2]: DEBUG: mpibind: user opts: ntasks=1 nthreads=0 restrict=0-95 greedy=-1 smt=-1 gpu_optim=-1 verbose=0 master=0 corespec_first=0 corespec_numa=0 corespec_bal=0 visible_devices=0 omp_proc_bind=0 omp_places=0 xml=shell-provided
0.080s: flux-shell[2]: DEBUG: mpibind: task 0: cpus 0-95
0.080s: flux-shell[2]: DEBUG: 2: task 2 on cores 0-47
0.083s: flux-shell[2]: DEBUG: mpibind: task 0: setting OMP_NUM_THREADS=96
0.084s: flux-shell[2]: DEBUG: mpibind: task 0: setting OMP_PLACES=threads
0.084s: flux-shell[2]: DEBUG: mpibind: task 0: setting OMP_PROC_BIND=spread
0.084s: flux-shell[2]: DEBUG: mpibind: task 0: setting ROCR_VISIBLE_DEVICES=2,3,1,0,6,7,5,4
0.068s: flux-shell[1]: DEBUG: Loading /etc/flux/shell/initrc.lua
0.068s: flux-shell[1]: TRACE: Successfully loaded flux.shell module
0.068s: flux-shell[1]: TRACE: trying to load /etc/flux/shell/initrc.lua
0.071s: flux-shell[1]: DEBUG: data-staging: Jobspec does not contain data-staging attributes. No staging necessary.
0.072s: flux-shell[1]: TRACE: trying to load /etc/flux/shell/lua.d/ascii.lua
0.076s: flux-shell[1]: TRACE: trying to load /etc/flux/shell/lua.d/intel_mpi.lua
0.076s: flux-shell[1]: TRACE: trying to load /etc/flux/shell/lua.d/mpibind.lua
0.079s: flux-shell[1]: DEBUG: mpibind: disabling cpu-affinity
0.079s: flux-shell[1]: DEBUG: mpibind: disabling gpu-affinity
0.079s: flux-shell[1]: TRACE: trying to load /etc/flux/shell/lua.d/openmpi.lua
0.079s: flux-shell[1]: TRACE: trying to load /etc/flux/shell/lua.d/rocr.lua
0.079s: flux-shell[1]: DEBUG: pmi-simple: simple wire protocol is enabled
0.079s: flux-shell[1]: DEBUG: cpu-affinity: disabling affinity due to cpu-affinity=off
0.080s: flux-shell[1]: DEBUG: gpu-affinity: disabling affinity due to gpu-affinity=off
0.080s: flux-shell[1]: DEBUG: mpibind: loaded topology from job shell
0.082s: flux-shell[1]: DEBUG: mpibind: flux given cores: 0-47
0.082s: flux-shell[1]: DEBUG: mpibind: derived pus: 0-95
0.082s: flux-shell[1]: DEBUG: mpibind: flux given gpus: 0-7
0.082s: flux-shell[1]: DEBUG: mpibind: total #cores: 48
0.082s: flux-shell[1]: DEBUG: mpibind: total #pus: 96
0.082s: flux-shell[1]: DEBUG: mpibind: user opts: ntasks=1 nthreads=0 restrict=0-95 greedy=-1 smt=-1 gpu_optim=-1 verbose=0 master=0 corespec_first=0 corespec_numa=0 corespec_bal=0 visible_devices=0 omp_proc_bind=0 omp_places=0 xml=shell-provided
0.082s: flux-shell[1]: DEBUG: mpibind: task 0: cpus 0-95
0.082s: flux-shell[1]: DEBUG: 1: task 1 on cores 0-47
0.083s: flux-shell[1]: DEBUG: mpibind: task 0: setting OMP_NUM_THREADS=96
0.083s: flux-shell[1]: DEBUG: mpibind: task 0: setting OMP_PLACES=threads
0.083s: flux-shell[1]: DEBUG: mpibind: task 0: setting OMP_PROC_BIND=spread
0.083s: flux-shell[1]: DEBUG: mpibind: task 0: setting ROCR_VISIBLE_DEVICES=2,3,1,0,6,7,5,4
0.043s: flux-shell[0]: DEBUG: Loading /etc/flux/shell/initrc.lua
0.043s: flux-shell[0]: TRACE: Successfully loaded flux.shell module
0.043s: flux-shell[0]: TRACE: trying to load /etc/flux/shell/initrc.lua
0.046s: flux-shell[0]: DEBUG: data-staging: Jobspec does not contain data-staging attributes. No staging necessary.
0.047s: flux-shell[0]: TRACE: trying to load /etc/flux/shell/lua.d/ascii.lua
0.050s: flux-shell[0]: TRACE: trying to load /etc/flux/shell/lua.d/intel_mpi.lua
0.050s: flux-shell[0]: TRACE: trying to load /etc/flux/shell/lua.d/mpibind.lua
0.054s: flux-shell[0]: DEBUG: mpibind: disabling cpu-affinity
0.054s: flux-shell[0]: DEBUG: mpibind: disabling gpu-affinity
0.054s: flux-shell[0]: TRACE: trying to load /etc/flux/shell/lua.d/openmpi.lua
0.054s: flux-shell[0]: TRACE: trying to load /etc/flux/shell/lua.d/rocr.lua
0.054s: flux-shell[0]: DEBUG: pmi-simple: simple wire protocol is enabled
0.054s: flux-shell[0]: DEBUG: output: batch timeout = 0.500s
0.055s: flux-shell[0]: DEBUG: cpu-affinity: disabling affinity due to cpu-affinity=off
0.055s: flux-shell[0]: DEBUG: gpu-affinity: disabling affinity due to gpu-affinity=off
0.056s: flux-shell[0]: DEBUG: mpibind: loaded topology from job shell
0.057s: flux-shell[0]: DEBUG: mpibind: flux given cores: 0-47
0.058s: flux-shell[0]: DEBUG: mpibind: derived pus: 0-95
0.058s: flux-shell[0]: DEBUG: mpibind: flux given gpus: 0-7
0.058s: flux-shell[0]: DEBUG: mpibind: total #cores: 48
0.058s: flux-shell[0]: DEBUG: mpibind: total #pus: 96
0.058s: flux-shell[0]: DEBUG: mpibind: user opts: ntasks=1 nthreads=0 restrict=0-95 greedy=-1 smt=-1 gpu_optim=-1 verbose=0 master=0 corespec_first=0 corespec_numa=0 corespec_bal=0 visible_devices=0 omp_proc_bind=0 omp_places=0 xml=shell-provided
0.058s: flux-shell[0]: DEBUG: mpibind: task 0: cpus 0-95
0.058s: flux-shell[0]: DEBUG: 0: task_count=3 slot_count=3 cores_per_slot=1 slots_per_node=1
0.058s: flux-shell[0]: DEBUG: 0: task 0 on cores 0-47
0.084s: flux-shell[0]: DEBUG: mpibind: task 0: setting OMP_NUM_THREADS=96
0.084s: flux-shell[0]: DEBUG: mpibind: task 0: setting OMP_PLACES=threads
0.084s: flux-shell[0]: DEBUG: mpibind: task 0: setting OMP_PROC_BIND=spread
0.084s: flux-shell[0]: DEBUG: mpibind: task 0: setting ROCR_VISIBLE_DEVICES=2,3,1,0,6,7,5,4
0.129s: flux-shell[0]: TRACE: pmi-simple: 0: C: cmd=init pmi_version=1 pmi_subversion=1
0.129s: flux-shell[0]: TRACE: pmi-simple: 0: S: cmd=response_to_init rc=0 pmi_version=1 pmi_subversion=1
0.129s: flux-shell[0]: TRACE: pmi-simple: 0: C: cmd=get_maxes
0.129s: flux-shell[0]: TRACE: pmi-simple: 0: S: cmd=maxes rc=0 kvsname_max=64 keylen_max=64 vallen_max=1024
0.130s: flux-shell[0]: TRACE: pmi-simple: 0: C: cmd=get_my_kvsname
0.130s: flux-shell[0]: TRACE: pmi-simple: 0: S: cmd=my_kvsname rc=0 kvsname=fEQLAnHv7
0.130s: flux-shell[0]: TRACE: pmi-simple: 0: C: cmd=get_my_kvsname
0.130s: flux-shell[0]: TRACE: pmi-simple: 0: S: cmd=my_kvsname rc=0 kvsname=fEQLAnHv7
0.130s: flux-shell[0]: TRACE: pmi-simple: 0: C: cmd=get kvsname=fEQLAnHv7 key=flux.taskmap
0.130s: flux-shell[0]: TRACE: pmi-simple: 0: S: cmd=get_result rc=0 value={"version":1,"map":[[0,3,1,1]]}
0.130s: flux-shell[0]: TRACE: pmi-simple: 0: C: cmd=get_universe_size
0.130s: flux-shell[0]: TRACE: pmi-simple: 0: S: cmd=universe_size rc=0 size=3
0.130s: flux-shell[0]: TRACE: pmi-simple: 0: C: cmd=get_appnum
0.130s: flux-shell[0]: TRACE: pmi-simple: 0: S: cmd=appnum rc=0 appnum=0
0.133s: flux-shell[0]: TRACE: pmi-simple: 0: C: cmd=get kvsname=fEQLAnHv7 key=0-0-key0
0.133s: flux-shell[0]: TRACE: pmi-simple: 0: S: cmd=get_result rc=4
0.130s: flux-shell[2]: TRACE: pmi-simple: 2: C: cmd=init pmi_version=1 pmi_subversion=1
0.130s: flux-shell[2]: TRACE: pmi-simple: 2: S: cmd=response_to_init rc=0 pmi_version=1 pmi_subversion=1
0.130s: flux-shell[2]: TRACE: pmi-simple: 2: C: cmd=get_maxes
0.130s: flux-shell[2]: TRACE: pmi-simple: 2: S: cmd=maxes rc=0 kvsname_max=64 keylen_max=64 vallen_max=1024
0.131s: flux-shell[2]: TRACE: pmi-simple: 2: C: cmd=get_my_kvsname
0.131s: flux-shell[2]: TRACE: pmi-simple: 2: S: cmd=my_kvsname rc=0 kvsname=fEQLAnHv7
0.131s: flux-shell[2]: TRACE: pmi-simple: 2: C: cmd=get_my_kvsname
0.131s: flux-shell[2]: TRACE: pmi-simple: 2: S: cmd=my_kvsname rc=0 kvsname=fEQLAnHv7
0.131s: flux-shell[2]: TRACE: pmi-simple: 2: C: cmd=get kvsname=fEQLAnHv7 key=flux.taskmap
0.131s: flux-shell[2]: TRACE: pmi-simple: 2: S: cmd=get_result rc=0 value={"version":1,"map":[[0,3,1,1]]}
0.131s: flux-shell[2]: TRACE: pmi-simple: 2: C: cmd=get_universe_size
0.131s: flux-shell[2]: TRACE: pmi-simple: 2: S: cmd=universe_size rc=0 size=3
0.131s: flux-shell[2]: TRACE: pmi-simple: 2: C: cmd=get_appnum
0.131s: flux-shell[2]: TRACE: pmi-simple: 2: S: cmd=appnum rc=0 appnum=0
0.134s: flux-shell[2]: TRACE: pmi-simple: 2: C: cmd=get kvsname=fEQLAnHv7 key=0-2-key0
0.134s: flux-shell[2]: TRACE: pmi-simple: 2: S: cmd=get_result rc=4
0.131s: flux-shell[1]: TRACE: pmi-simple: 1: C: cmd=init pmi_version=1 pmi_subversion=1
0.131s: flux-shell[1]: TRACE: pmi-simple: 1: S: cmd=response_to_init rc=0 pmi_version=1 pmi_subversion=1
0.131s: flux-shell[1]: TRACE: pmi-simple: 1: C: cmd=get_maxes
0.131s: flux-shell[1]: TRACE: pmi-simple: 1: S: cmd=maxes rc=0 kvsname_max=64 keylen_max=64 vallen_max=1024
0.132s: flux-shell[1]: TRACE: pmi-simple: 1: C: cmd=get_my_kvsname
0.132s: flux-shell[1]: TRACE: pmi-simple: 1: S: cmd=my_kvsname rc=0 kvsname=fEQLAnHv7
0.132s: flux-shell[1]: TRACE: pmi-simple: 1: C: cmd=get_my_kvsname
0.132s: flux-shell[1]: TRACE: pmi-simple: 1: S: cmd=my_kvsname rc=0 kvsname=fEQLAnHv7
0.132s: flux-shell[1]: TRACE: pmi-simple: 1: C: cmd=get kvsname=fEQLAnHv7 key=flux.taskmap
0.132s: flux-shell[1]: TRACE: pmi-simple: 1: S: cmd=get_result rc=0 value={"version":1,"map":[[0,3,1,1]]}
0.132s: flux-shell[1]: TRACE: pmi-simple: 1: C: cmd=get_universe_size
0.132s: flux-shell[1]: TRACE: pmi-simple: 1: S: cmd=universe_size rc=0 size=3
0.132s: flux-shell[1]: TRACE: pmi-simple: 1: C: cmd=get_appnum
0.132s: flux-shell[1]: TRACE: pmi-simple: 1: S: cmd=appnum rc=0 appnum=0
0.135s: flux-shell[1]: TRACE: pmi-simple: 1: C: cmd=get kvsname=fEQLAnHv7 key=0-1-key0
0.135s: flux-shell[1]: TRACE: pmi-simple: 1: S: cmd=get_result rc=4
1.427s: flux-shell[0]: TRACE: pmi-simple: 0: C: cmd=put kvsname=fEQLAnHv7 key=0-0-key0 value=cG1peC5jcHVzZXQAMDMAMDAwNQAwLTk1AHBtaXguaG5hbWUAMDMAMDAwYQBjb3JvbmEyMDYAYnRsLm9wZW5pYi40LjEAMTQAMDAzNgABAAAAAAAAgP6BAAAABQAAAMkCAAAbEAAAAAEDPxgkbwAAgQABAP6AAAAAAAAADEKhAwBarcBtdGwub2ZpLjQuMQAxNAAwMDIwAAOAbwCBAAAAAAAAAAAAgP7ArVoAA6FCDAAAAAAAAAAAcG1sLnVjeC40LjEAMTQAMDBkYwAgWUDHX2lXpu8hBgACwKiAzs8ZF7dROAoaLlAjra42ABMjAIKIySEGAALAqEDOzxkXt1E4v1PiTHbPwTYBEyMAgosNIRIBAm8Rsgo9cBu+mAAA8AAAAADPGRe3UThTnj5L1+A3NwETIwCCnLckgwCBAH45d8wrM+9IOFDbKjE1MsMjAAWDbwAYBCeKsA+hM+9IOFCwDyE1Mt8iAAB61HfMKzPvSDhQsA8hNTLfIgABADHRXHzhM+9IOFBFHSk1MgMnAAOObwCzD3fMqzPvSDhQRR0pNTIDJwCDkW8AcG1sLnVjeC40LjEAMTQAMDE2NAAgWUDHX2lXpu8gAGN1d8wrMgD4108AAAAAAE8DAIihSrGsXNBN3yEGAALAqIDOzxkXt1E4ChouUCOtrjYAEyMAgojJIQYAAsCoQM7PGRe3UTi/U+JMds/BNgETIwCCiw0hEgECbxGyCj1wG76YAADwAAAAAM8ZF7dROFOePkvX4Dc3ARMjAIKct0IIbxGyCj1wGz6NU3fMKzIAAHBQd8yrMwBPIwCIHwAHAAAAAABDCG8Rsgo9cBs+p9N3zCsyAABwUHfMqzMATyMAiAixP8AJAADAJAMAgQB+OXfMKzPvSDhQ2yoxNTLDIwAFg28AGAQnirAPoTPvSDhQsA8hNTLfIgAAetR3zCsz70g4ULAPITUy3yIAAQAx0Vx84TPvSDhQRR0pNTIDJwADjm8Asw93zKsz70g4UEUdKTUyAycAg5FvACW
1.427s: flux-shell[0]: TRACE: pmi-simple: 0: S: cmd=put_result rc=0
1.434s: flux-shell[0]: TRACE: pmi-simple: 0: C: cmd=put kvsname=fEQLAnHv7 key=0-0-key1 value=IbxGyCj1wGz5Hi703BjYAJC5Qd8yrMwCSIQCECLE/AHBtbC5iYXNlLjIuMAAxNAAwMDA0AHVjeAA -
1.434s: flux-shell[0]: TRACE: pmi-simple: 0: S: cmd=put_result rc=0
1.434s: flux-shell[0]: TRACE: pmi-simple: 0: C: cmd=barrier_in
1.452s: flux-shell[1]: TRACE: pmi-simple: 1: C: cmd=put kvsname=fEQLAnHv7 key=0-1-key0 value=cG1peC5jcHVzZXQAMDMAMDAwNQAwLTk1AHBtaXguaG5hbWUAMDMAMDAwYQBjb3JvbmEyMDcAYnRsLm9wZW5pYi40LjEAMTQAMDAzNgABAAAAAAAAgP5bAAAABQAAAMkCAAAbEAAAAAEDPxjimwAAWwABAP6AAAAAAAAADEKhAwBarbxtdGwub2ZpLjQuMQAxNAAwMDIwAAM+nABbAAAAAAAAAAAAgP68rVoAA6FCDAAAAAAAAAAAcG1sLnVjeC40LjEAMTQAMDBkYwAg9gECSPg9FVIhEgECEZTtisloWbSYAADwAAAAAM8ZF7dROFOePkvX4Dc3ARMjAILmKyEGAALAqIDPzxkXt1E4ChouUCOtrjYAEyMAgpjjIQYAAsCoQM/PGRe3UTi/U+JMds/BNgETIwCCpPskgwBbAH45d8wrM+9IOFDbKjE1MsMjAAVAnAAxBCeKsA+hM+9IOFCwDyE1Mt8iAAB61HfMKzPvSDhQsA8hNTLfIgABADHRXHzhM+9IOFBFHSk1MgMnAANMnACzD3fMqzPvSDhQRR0pNTIDJwCDT5wAcG1sLnVjeC40LjEAMTQAMDE2NAAg9gECSPg9FVIgAGN1d8wrMgD4108AAAAAAE8DAIjub/N2VuY6fCESAQIRlO2KyWhZtJgAAPAAAAAAzxkXt1E4U54+S9fgNzcBEyMAguYrIQYAAsCogM/PGRe3UTgKGi5QI62uNgATIwCCmOMhBgACwKhAz88ZF7dROL9T4kx2z8E2ARMjAIKk+0IIEZTtisloWTSNU3fMKzIAAHBQd8yrMwBPIwCIOgANAAAAAABDCBGU7YrJaFk0p9N3zCsyAABwUHfMqzMATyMAiCy5PcAJAADAJAMAWwB+OXfMKzPvSDhQ2yoxNTLDIwAFQJwAMQQnirAPoTPvSDhQsA8hNTLfIgAAetR3zCsz70g4ULAPITUy3yIAAQAx0Vx84TPvSDhQRR0pNTIDJwADTJwAsw93zKsz70g4UEUdKTUyAycAg0+cACW
1.452s: flux-shell[1]: TRACE: pmi-simple: 1: S: cmd=put_result rc=0
1.462s: flux-shell[1]: TRACE: pmi-simple: 1: C: cmd=put kvsname=fEQLAnHv7 key=0-1-key1 value=IEZTtisloWTRHi703BjYAJC5Qd8yrMwCSIQCELLk9AA -
1.462s: flux-shell[1]: TRACE: pmi-simple: 1: S: cmd=put_result rc=0
1.460s: flux-shell[2]: TRACE: pmi-simple: 2: C: cmd=put kvsname=fEQLAnHv7 key=0-2-key0 value=cG1peC5jcHVzZXQAMDMAMDAwNQAwLTk1AHBtaXguaG5hbWUAMDMAMDAwYQBjb3JvbmEyMTQAYnRsLm9wZW5pYi40LjEAMTQAMDAzNgABAAAAAAAAgP5iAAAABQAAAMkCAAAbEAAAAAEDPxjkZgEAYgABAP6AAAAAAAAADEKhAwBarOBtdGwub2ZpLjQuMQAxNAAwMDIwAANAZwFiAAAAAAAAAAAAgP7grFoAA6FCDAAAAAAAAAAAcG1sLnVjeC40LjEAMTQAMDBkYwAgTyjnvBJrLgchBgACwKiA1s8ZF7dROAoaLlAjra42ABMjAIKycSEGAALAqEDWzxkXt1E4v1PiTHbPwTYBEyMAgqazIRIBAszr2at66x2CmAAA8AAAAADPGRe3UThTnj5L1+A3NwETIwCCyYEkgwBiAH45d8wrM+9IOFDbKjE1MsMjAAVCZwEIBCeKsA+hM+9IOFCwDyE1Mt8iAAB61HfMKzPvSDhQsA8hNTLfIgABADHRXHzhM+9IOFBFHSk1MgMnAANOZwGzD3fMqzPvSDhQRR0pNTIDJwCDUWcBcG1sLnVjeC40LjEAMTQAMDE2NAAgTyjnvBJrLgcgAGN1d8wrMgD4108AAAAAAE8DAIi9VzmKHgS14yEGAALAqIDWzxkXt1E4ChouUCOtrjYAEyMAgrJxIQYAAsCoQNbPGRe3UTi/U+JMds/BNgETIwCCprMhEgECzOvZq3rrHYKYAADwAAAAAM8ZF7dROFOePkvX4Dc3ARMjAILJgUIIzOvZq3rrHQKNU3fMKzIAAHBQd8yrMwBPIwCIMABuAAAAAABDCMzr2at66x0Cp9N3zCsyAABwUHfMqzMATyMAiH6MBsAJAADAJAMAYgB+OXfMKzPvSDhQ2yoxNTLDIwAFQmcBCAQnirAPoTPvSDhQsA8hNTLfIgAAetR3zCsz70g4ULAPITUy3yIAAQAx0Vx84TPvSDhQRR0pNTIDJwADTmcBsw93zKsz70g4UEUdKTUyAycAg1FnASW
1.460s: flux-shell[2]: TRACE: pmi-simple: 2: S: cmd=put_result rc=0
1.466s: flux-shell[2]: TRACE: pmi-simple: 2: C: cmd=put kvsname=fEQLAnHv7 key=0-2-key1 value=IzOvZq3rrHQJHi703BjYAJC5Qd8yrMwCSIQCEfowGAA -
1.466s: flux-shell[2]: TRACE: pmi-simple: 2: S: cmd=put_result rc=0
1.467s: flux-shell[2]: TRACE: pmi-simple: 2: C: cmd=barrier_in
1.467s: flux-shell[2]: TRACE: pmi-simple: 2: S: cmd=barrier_out rc=0
1.468s: flux-shell[2]: TRACE: pmi-simple: 2: C: cmd=get kvsname=fEQLAnHv7 key=0-2-key0
1.468s: flux-shell[2]: TRACE: pmi-simple: 2: S: cmd=get_result rc=0 value=cG1peC5jcHVzZXQAMDMAMDAwNQAwLTk1AHBtaXguaG5hbWUAMDMAMDAwYQBjb3JvbmEyMTQAYnRsLm9wZW5pYi40LjEAMTQAMDAzNgABAAAAAAAAgP5iAAAABQAAAMkCAAAbEAAAAAEDPxjkZgEAYgABAP6AAAAAAAAADEKhAwBarOBtdGwub2ZpLjQuMQAxNAAwMDIwAANAZwFiAAAAAAAAAAAAgP7grFoAA6FCDAAAAAAAAAAAcG1sLnVjeC40LjEAMTQAMDBkYwAgTyjnvBJrLgchBgACwKiA1s8ZF7dROAoaLlAjra42ABMjAIKycSEGAALAqEDWzxkXt1E4v1PiTHbPwTYBEyMAgqazIRIBAszr2at66x2CmAAA8AAAAADPGRe3UThTnj5L1+A3NwETIwCCyYEkgwBiAH45d8wrM+9IOFDbKjE1MsMjAAVCZwEIBCeKsA+hM+9IOFCwDyE1Mt8iAAB61HfMKzPvSDhQsA8hNTLfIgABADHRXHzhM+9IOFBFHSk1MgMnAANOZwGzD3fMqzPvSDhQRR0pNTIDJwCDUWcBcG1sLnVjeC40LjEAMTQAMDE2NAAgTyjnvBJrLgcgAGN1d8wrMgD4108AAAAAAE8DAIi9VzmKHgS14yEGAALAqIDWzxkXt1E4ChouUCOtrjYAEyMAgrJxIQYAAsCoQNbPGRe3UTi/U+JMds/BNgETIwCCprMhEgECzOvZq3rrHYKYAADwAAAAAM8ZF7dROFOePkvX4Dc3ARMjAILJgUIIzOvZq3rrHQKNU3fMKzIAAHBQd8yrMwBPIwCIMABuAAAAAABDCMzr2at66x0Cp9N3zCsyAABwUHfMqzMATyMAiH6MBsAJAADAJAMAYgB+OXfMKzPvSDhQ2yoxNTLDIwAFQmcBCAQnirAPoTPvSDhQsA8hNTLfIgAAetR3zCsz70g4ULAPITUy3yIAAQAx0Vx84TPvSDhQRR0pNTIDJwADTmcBsw93zKsz70g4UEUdKTUyAycAg1FnASW
1.468s: flux-shell[2]: TRACE: pmi-simple: 2: C: cmd=get kvsname=fEQLAnHv7 key=0-2-key1
1.468s: flux-shell[2]: TRACE: pmi-simple: 2: S: cmd=get_result rc=0 value=IzOvZq3rrHQJHi703BjYAJC5Qd8yrMwCSIQCEfowGAA -
1.462s: flux-shell[1]: TRACE: pmi-simple: 1: C: cmd=barrier_in
1.467s: flux-shell[1]: TRACE: pmi-simple: 1: S: cmd=barrier_out rc=0
1.468s: flux-shell[1]: TRACE: pmi-simple: 1: C: cmd=get kvsname=fEQLAnHv7 key=0-1-key0
1.468s: flux-shell[1]: TRACE: pmi-simple: 1: S: cmd=get_result rc=0 value=cG1peC5jcHVzZXQAMDMAMDAwNQAwLTk1AHBtaXguaG5hbWUAMDMAMDAwYQBjb3JvbmEyMDcAYnRsLm9wZW5pYi40LjEAMTQAMDAzNgABAAAAAAAAgP5bAAAABQAAAMkCAAAbEAAAAAEDPxjimwAAWwABAP6AAAAAAAAADEKhAwBarbxtdGwub2ZpLjQuMQAxNAAwMDIwAAM+nABbAAAAAAAAAAAAgP68rVoAA6FCDAAAAAAAAAAAcG1sLnVjeC40LjEAMTQAMDBkYwAg9gECSPg9FVIhEgECEZTtisloWbSYAADwAAAAAM8ZF7dROFOePkvX4Dc3ARMjAILmKyEGAALAqIDPzxkXt1E4ChouUCOtrjYAEyMAgpjjIQYAAsCoQM/PGRe3UTi/U+JMds/BNgETIwCCpPskgwBbAH45d8wrM+9IOFDbKjE1MsMjAAVAnAAxBCeKsA+hM+9IOFCwDyE1Mt8iAAB61HfMKzPvSDhQsA8hNTLfIgABADHRXHzhM+9IOFBFHSk1MgMnAANMnACzD3fMqzPvSDhQRR0pNTIDJwCDT5wAcG1sLnVjeC40LjEAMTQAMDE2NAAg9gECSPg9FVIgAGN1d8wrMgD4108AAAAAAE8DAIjub/N2VuY6fCESAQIRlO2KyWhZtJgAAPAAAAAAzxkXt1E4U54+S9fgNzcBEyMAguYrIQYAAsCogM/PGRe3UTgKGi5QI62uNgATIwCCmOMhBgACwKhAz88ZF7dROL9T4kx2z8E2ARMjAIKk+0IIEZTtisloWTSNU3fMKzIAAHBQd8yrMwBPIwCIOgANAAAAAABDCBGU7YrJaFk0p9N3zCsyAABwUHfMqzMATyMAiCy5PcAJAADAJAMAWwB+OXfMKzPvSDhQ2yoxNTLDIwAFQJwAMQQnirAPoTPvSDhQsA8hNTLfIgAAetR3zCsz70g4ULAPITUy3yIAAQAx0Vx84TPvSDhQRR0pNTIDJwADTJwAsw93zKsz70g4UEUdKTUyAycAg0+cACW
1.468s: flux-shell[1]: TRACE: pmi-simple: 1: C: cmd=get kvsname=fEQLAnHv7 key=0-1-key1
1.468s: flux-shell[1]: TRACE: pmi-simple: 1: S: cmd=get_result rc=0 value=IEZTtisloWTRHi703BjYAJC5Qd8yrMwCSIQCELLk9AA -
1.467s: flux-shell[0]: TRACE: pmi-simple: 0: S: cmd=barrier_out rc=0
1.467s: flux-shell[0]: TRACE: pmi-simple: 0: C: cmd=get kvsname=fEQLAnHv7 key=0-0-key0
1.467s: flux-shell[0]: TRACE: pmi-simple: 0: S: cmd=get_result rc=0 value=cG1peC5jcHVzZXQAMDMAMDAwNQAwLTk1AHBtaXguaG5hbWUAMDMAMDAwYQBjb3JvbmEyMDYAYnRsLm9wZW5pYi40LjEAMTQAMDAzNgABAAAAAAAAgP6BAAAABQAAAMkCAAAbEAAAAAEDPxgkbwAAgQABAP6AAAAAAAAADEKhAwBarcBtdGwub2ZpLjQuMQAxNAAwMDIwAAOAbwCBAAAAAAAAAAAAgP7ArVoAA6FCDAAAAAAAAAAAcG1sLnVjeC40LjEAMTQAMDBkYwAgWUDHX2lXpu8hBgACwKiAzs8ZF7dROAoaLlAjra42ABMjAIKIySEGAALAqEDOzxkXt1E4v1PiTHbPwTYBEyMAgosNIRIBAm8Rsgo9cBu+mAAA8AAAAADPGRe3UThTnj5L1+A3NwETIwCCnLckgwCBAH45d8wrM+9IOFDbKjE1MsMjAAWDbwAYBCeKsA+hM+9IOFCwDyE1Mt8iAAB61HfMKzPvSDhQsA8hNTLfIgABADHRXHzhM+9IOFBFHSk1MgMnAAOObwCzD3fMqzPvSDhQRR0pNTIDJwCDkW8AcG1sLnVjeC40LjEAMTQAMDE2NAAgWUDHX2lXpu8gAGN1d8wrMgD4108AAAAAAE8DAIihSrGsXNBN3yEGAALAqIDOzxkXt1E4ChouUCOtrjYAEyMAgojJIQYAAsCoQM7PGRe3UTi/U+JMds/BNgETIwCCiw0hEgECbxGyCj1wG76YAADwAAAAAM8ZF7dROFOePkvX4Dc3ARMjAIKct0IIbxGyCj1wGz6NU3fMKzIAAHBQd8yrMwBPIwCIHwAHAAAAAABDCG8Rsgo9cBs+p9N3zCsyAABwUHfMqzMATyMAiAixP8AJAADAJAMAgQB+OXfMKzPvSDhQ2yoxNTLDIwAFg28AGAQnirAPoTPvSDhQsA8hNTLfIgAAetR3zCsz70g4ULAPITUy3yIAAQAx0Vx84TPvSDhQRR0pNTIDJwADjm8Asw93zKsz70g4UEUdKTUyAycAg5FvACW
1.468s: flux-shell[0]: TRACE: pmi-simple: 0: C: cmd=get kvsname=fEQLAnHv7 key=0-0-key1
1.468s: flux-shell[0]: TRACE: pmi-simple: 0: S: cmd=get_result rc=0 value=IbxGyCj1wGz5Hi703BjYAJC5Qd8yrMwCSIQCECLE/AHBtbC5iYXNlLjIuMAAxNAAwMDA0AHVjeAA -
1.482s: flux-shell[2]: TRACE: pmi-simple: 2: C: cmd=barrier_in
1.483s: flux-shell[2]: TRACE: pmi-simple: 2: S: cmd=barrier_out rc=0
1.483s: flux-shell[2]: TRACE: pmi-simple: 2: C: cmd=get kvsname=fEQLAnHv7 key=0-0-key0
1.483s: flux-shell[2]: TRACE: pmi-simple: 2: S: cmd=get_result rc=0 value=cG1peC5jcHVzZXQAMDMAMDAwNQAwLTk1AHBtaXguaG5hbWUAMDMAMDAwYQBjb3JvbmEyMDYAYnRsLm9wZW5pYi40LjEAMTQAMDAzNgABAAAAAAAAgP6BAAAABQAAAMkCAAAbEAAAAAEDPxgkbwAAgQABAP6AAAAAAAAADEKhAwBarcBtdGwub2ZpLjQuMQAxNAAwMDIwAAOAbwCBAAAAAAAAAAAAgP7ArVoAA6FCDAAAAAAAAAAAcG1sLnVjeC40LjEAMTQAMDBkYwAgWUDHX2lXpu8hBgACwKiAzs8ZF7dROAoaLlAjra42ABMjAIKIySEGAALAqEDOzxkXt1E4v1PiTHbPwTYBEyMAgosNIRIBAm8Rsgo9cBu+mAAA8AAAAADPGRe3UThTnj5L1+A3NwETIwCCnLckgwCBAH45d8wrM+9IOFDbKjE1MsMjAAWDbwAYBCeKsA+hM+9IOFCwDyE1Mt8iAAB61HfMKzPvSDhQsA8hNTLfIgABADHRXHzhM+9IOFBFHSk1MgMnAAOObwCzD3fMqzPvSDhQRR0pNTIDJwCDkW8AcG1sLnVjeC40LjEAMTQAMDE2NAAgWUDHX2lXpu8gAGN1d8wrMgD4108AAAAAAE8DAIihSrGsXNBN3yEGAALAqIDOzxkXt1E4ChouUCOtrjYAEyMAgojJIQYAAsCoQM7PGRe3UTi/U+JMds/BNgETIwCCiw0hEgECbxGyCj1wG76YAADwAAAAAM8ZF7dROFOePkvX4Dc3ARMjAIKct0IIbxGyCj1wGz6NU3fMKzIAAHBQd8yrMwBPIwCIHwAHAAAAAABDCG8Rsgo9cBs+p9N3zCsyAABwUHfMqzMATyMAiAixP8AJAADAJAMAgQB+OXfMKzPvSDhQ2yoxNTLDIwAFg28AGAQnirAPoTPvSDhQsA8hNTLfIgAAetR3zCsz70g4ULAPITUy3yIAAQAx0Vx84TPvSDhQRR0pNTIDJwADjm8Asw93zKsz70g4UEUdKTUyAycAg5FvACW
1.484s: flux-shell[2]: TRACE: pmi-simple: 2: C: cmd=get kvsname=fEQLAnHv7 key=0-0-key1
1.484s: flux-shell[2]: TRACE: pmi-simple: 2: S: cmd=get_result rc=0 value=IbxGyCj1wGz5Hi703BjYAJC5Qd8yrMwCSIQCECLE/AHBtbC5iYXNlLjIuMAAxNAAwMDA0AHVjeAA -
1.482s: flux-shell[0]: TRACE: pmi-simple: 0: C: cmd=barrier_in
1.483s: flux-shell[0]: TRACE: pmi-simple: 0: S: cmd=barrier_out rc=0
1.482s: flux-shell[1]: TRACE: pmi-simple: 1: C: cmd=barrier_in
1.483s: flux-shell[1]: TRACE: pmi-simple: 1: S: cmd=barrier_out rc=0
1.483s: flux-shell[1]: TRACE: pmi-simple: 1: C: cmd=get kvsname=fEQLAnHv7 key=0-0-key0
1.483s: flux-shell[1]: TRACE: pmi-simple: 1: S: cmd=get_result rc=0 value=cG1peC5jcHVzZXQAMDMAMDAwNQAwLTk1AHBtaXguaG5hbWUAMDMAMDAwYQBjb3JvbmEyMDYAYnRsLm9wZW5pYi40LjEAMTQAMDAzNgABAAAAAAAAgP6BAAAABQAAAMkCAAAbEAAAAAEDPxgkbwAAgQABAP6AAAAAAAAADEKhAwBarcBtdGwub2ZpLjQuMQAxNAAwMDIwAAOAbwCBAAAAAAAAAAAAgP7ArVoAA6FCDAAAAAAAAAAAcG1sLnVjeC40LjEAMTQAMDBkYwAgWUDHX2lXpu8hBgACwKiAzs8ZF7dROAoaLlAjra42ABMjAIKIySEGAALAqEDOzxkXt1E4v1PiTHbPwTYBEyMAgosNIRIBAm8Rsgo9cBu+mAAA8AAAAADPGRe3UThTnj5L1+A3NwETIwCCnLckgwCBAH45d8wrM+9IOFDbKjE1MsMjAAWDbwAYBCeKsA+hM+9IOFCwDyE1Mt8iAAB61HfMKzPvSDhQsA8hNTLfIgABADHRXHzhM+9IOFBFHSk1MgMnAAOObwCzD3fMqzPvSDhQRR0pNTIDJwCDkW8AcG1sLnVjeC40LjEAMTQAMDE2NAAgWUDHX2lXpu8gAGN1d8wrMgD4108AAAAAAE8DAIihSrGsXNBN3yEGAALAqIDOzxkXt1E4ChouUCOtrjYAEyMAgojJIQYAAsCoQM7PGRe3UTi/U+JMds/BNgETIwCCiw0hEgECbxGyCj1wG76YAADwAAAAAM8ZF7dROFOePkvX4Dc3ARMjAIKct0IIbxGyCj1wGz6NU3fMKzIAAHBQd8yrMwBPIwCIHwAHAAAAAABDCG8Rsgo9cBs+p9N3zCsyAABwUHfMqzMATyMAiAixP8AJAADAJAMAgQB+OXfMKzPvSDhQ2yoxNTLDIwAFg28AGAQnirAPoTPvSDhQsA8hNTLfIgAAetR3zCsz70g4ULAPITUy3yIAAQAx0Vx84TPvSDhQRR0pNTIDJwADjm8Asw93zKsz70g4UEUdKTUyAycAg5FvACW
1.483s: flux-shell[1]: TRACE: pmi-simple: 1: C: cmd=get kvsname=fEQLAnHv7 key=0-0-key1
1.484s: flux-shell[1]: TRACE: pmi-simple: 1: S: cmd=get_result rc=0 value=IbxGyCj1wGz5Hi703BjYAJC5Qd8yrMwCSIQCECLE/AHBtbC5iYXNlLjIuMAAxNAAwMDA0AHVjeAA -
1.493s: flux-shell[0]: TRACE: pmi-simple: 0: C: cmd=get kvsname=fEQLAnHv7 key=0-1-key0
1.493s: flux-shell[0]: TRACE: pmi-simple: 0: S: cmd=get_result rc=0 value=cG1peC5jcHVzZXQAMDMAMDAwNQAwLTk1AHBtaXguaG5hbWUAMDMAMDAwYQBjb3JvbmEyMDcAYnRsLm9wZW5pYi40LjEAMTQAMDAzNgABAAAAAAAAgP5bAAAABQAAAMkCAAAbEAAAAAEDPxjimwAAWwABAP6AAAAAAAAADEKhAwBarbxtdGwub2ZpLjQuMQAxNAAwMDIwAAM+nABbAAAAAAAAAAAAgP68rVoAA6FCDAAAAAAAAAAAcG1sLnVjeC40LjEAMTQAMDBkYwAg9gECSPg9FVIhEgECEZTtisloWbSYAADwAAAAAM8ZF7dROFOePkvX4Dc3ARMjAILmKyEGAALAqIDPzxkXt1E4ChouUCOtrjYAEyMAgpjjIQYAAsCoQM/PGRe3UTi/U+JMds/BNgETIwCCpPskgwBbAH45d8wrM+9IOFDbKjE1MsMjAAVAnAAxBCeKsA+hM+9IOFCwDyE1Mt8iAAB61HfMKzPvSDhQsA8hNTLfIgABADHRXHzhM+9IOFBFHSk1MgMnAANMnACzD3fMqzPvSDhQRR0pNTIDJwCDT5wAcG1sLnVjeC40LjEAMTQAMDE2NAAg9gECSPg9FVIgAGN1d8wrMgD4108AAAAAAE8DAIjub/N2VuY6fCESAQIRlO2KyWhZtJgAAPAAAAAAzxkXt1E4U54+S9fgNzcBEyMAguYrIQYAAsCogM/PGRe3UTgKGi5QI62uNgATIwCCmOMhBgACwKhAz88ZF7dROL9T4kx2z8E2ARMjAIKk+0IIEZTtisloWTSNU3fMKzIAAHBQd8yrMwBPIwCIOgANAAAAAABDCBGU7YrJaFk0p9N3zCsyAABwUHfMqzMATyMAiCy5PcAJAADAJAMAWwB+OXfMKzPvSDhQ2yoxNTLDIwAFQJwAMQQnirAPoTPvSDhQsA8hNTLfIgAAetR3zCsz70g4ULAPITUy3yIAAQAx0Vx84TPvSDhQRR0pNTIDJwADTJwAsw93zKsz70g4UEUdKTUyAycAg0+cACW
1.494s: flux-shell[0]: TRACE: pmi-simple: 0: C: cmd=get kvsname=fEQLAnHv7 key=0-1-key1
1.494s: flux-shell[0]: TRACE: pmi-simple: 0: S: cmd=get_result rc=0 value=IEZTtisloWTRHi703BjYAJC5Qd8yrMwCSIQCELLk9AA -
1.496s: flux-shell[0]: TRACE: pmi-simple: 0: C: cmd=get kvsname=fEQLAnHv7 key=0-2-key0
1.496s: flux-shell[0]: TRACE: pmi-simple: 0: S: cmd=get_result rc=0 value=cG1peC5jcHVzZXQAMDMAMDAwNQAwLTk1AHBtaXguaG5hbWUAMDMAMDAwYQBjb3JvbmEyMTQAYnRsLm9wZW5pYi40LjEAMTQAMDAzNgABAAAAAAAAgP5iAAAABQAAAMkCAAAbEAAAAAEDPxjkZgEAYgABAP6AAAAAAAAADEKhAwBarOBtdGwub2ZpLjQuMQAxNAAwMDIwAANAZwFiAAAAAAAAAAAAgP7grFoAA6FCDAAAAAAAAAAAcG1sLnVjeC40LjEAMTQAMDBkYwAgTyjnvBJrLgchBgACwKiA1s8ZF7dROAoaLlAjra42ABMjAIKycSEGAALAqEDWzxkXt1E4v1PiTHbPwTYBEyMAgqazIRIBAszr2at66x2CmAAA8AAAAADPGRe3UThTnj5L1+A3NwETIwCCyYEkgwBiAH45d8wrM+9IOFDbKjE1MsMjAAVCZwEIBCeKsA+hM+9IOFCwDyE1Mt8iAAB61HfMKzPvSDhQsA8hNTLfIgABADHRXHzhM+9IOFBFHSk1MgMnAANOZwGzD3fMqzPvSDhQRR0pNTIDJwCDUWcBcG1sLnVjeC40LjEAMTQAMDE2NAAgTyjnvBJrLgcgAGN1d8wrMgD4108AAAAAAE8DAIi9VzmKHgS14yEGAALAqIDWzxkXt1E4ChouUCOtrjYAEyMAgrJxIQYAAsCoQNbPGRe3UTi/U+JMds/BNgETIwCCprMhEgECzOvZq3rrHYKYAADwAAAAAM8ZF7dROFOePkvX4Dc3ARMjAILJgUIIzOvZq3rrHQKNU3fMKzIAAHBQd8yrMwBPIwCIMABuAAAAAABDCMzr2at66x0Cp9N3zCsyAABwUHfMqzMATyMAiH6MBsAJAADAJAMAYgB+OXfMKzPvSDhQ2yoxNTLDIwAFQmcBCAQnirAPoTPvSDhQsA8hNTLfIgAAetR3zCsz70g4ULAPITUy3yIAAQAx0Vx84TPvSDhQRR0pNTIDJwADTmcBsw93zKsz70g4UEUdKTUyAycAg1FnASW
1.496s: flux-shell[0]: TRACE: pmi-simple: 0: C: cmd=get kvsname=fEQLAnHv7 key=0-2-key1
1.496s: flux-shell[0]: TRACE: pmi-simple: 0: S: cmd=get_result rc=0 value=IzOvZq3rrHQJHi703BjYAJC5Qd8yrMwCSIQCEfowGAA -
1.499s: flux-shell[0]: TRACE: pmi-simple: 0: C: cmd=barrier_in
1.499s: flux-shell[0]: TRACE: pmi-simple: 0: S: cmd=barrier_out rc=0
1.496s: flux-shell[1]: TRACE: pmi-simple: 1: C: cmd=barrier_in
1.500s: flux-shell[1]: TRACE: pmi-simple: 1: S: cmd=barrier_out rc=0
1.500s: flux-shell[1]: TRACE: pmi-simple: 1: C: cmd=get kvsname=fEQLAnHv7 key=0-2-key0
1.500s: flux-shell[1]: TRACE: pmi-simple: 1: S: cmd=get_result rc=0 value=cG1peC5jcHVzZXQAMDMAMDAwNQAwLTk1AHBtaXguaG5hbWUAMDMAMDAwYQBjb3JvbmEyMTQAYnRsLm9wZW5pYi40LjEAMTQAMDAzNgABAAAAAAAAgP5iAAAABQAAAMkCAAAbEAAAAAEDPxjkZgEAYgABAP6AAAAAAAAADEKhAwBarOBtdGwub2ZpLjQuMQAxNAAwMDIwAANAZwFiAAAAAAAAAAAAgP7grFoAA6FCDAAAAAAAAAAAcG1sLnVjeC40LjEAMTQAMDBkYwAgTyjnvBJrLgchBgACwKiA1s8ZF7dROAoaLlAjra42ABMjAIKycSEGAALAqEDWzxkXt1E4v1PiTHbPwTYBEyMAgqazIRIBAszr2at66x2CmAAA8AAAAADPGRe3UThTnj5L1+A3NwETIwCCyYEkgwBiAH45d8wrM+9IOFDbKjE1MsMjAAVCZwEIBCeKsA+hM+9IOFCwDyE1Mt8iAAB61HfMKzPvSDhQsA8hNTLfIgABADHRXHzhM+9IOFBFHSk1MgMnAANOZwGzD3fMqzPvSDhQRR0pNTIDJwCDUWcBcG1sLnVjeC40LjEAMTQAMDE2NAAgTyjnvBJrLgcgAGN1d8wrMgD4108AAAAAAE8DAIi9VzmKHgS14yEGAALAqIDWzxkXt1E4ChouUCOtrjYAEyMAgrJxIQYAAsCoQNbPGRe3UTi/U+JMds/BNgETIwCCprMhEgECzOvZq3rrHYKYAADwAAAAAM8ZF7dROFOePkvX4Dc3ARMjAILJgUIIzOvZq3rrHQKNU3fMKzIAAHBQd8yrMwBPIwCIMABuAAAAAABDCMzr2at66x0Cp9N3zCsyAABwUHfMqzMATyMAiH6MBsAJAADAJAMAYgB+OXfMKzPvSDhQ2yoxNTLDIwAFQmcBCAQnirAPoTPvSDhQsA8hNTLfIgAAetR3zCsz70g4ULAPITUy3yIAAQAx0Vx84TPvSDhQRR0pNTIDJwADTmcBsw93zKsz70g4UEUdKTUyAycAg1FnASW
1.500s: flux-shell[1]: TRACE: pmi-simple: 1: C: cmd=get kvsname=fEQLAnHv7 key=0-2-key1
1.500s: flux-shell[1]: TRACE: pmi-simple: 1: S: cmd=get_result rc=0 value=IzOvZq3rrHQJHi703BjYAJC5Qd8yrMwCSIQCEfowGAA -
1.500s: flux-shell[1]: TRACE: pmi-simple: 1: C: cmd=barrier_in
1.499s: flux-shell[2]: TRACE: pmi-simple: 2: C: cmd=barrier_in
1.500s: flux-shell[2]: TRACE: pmi-simple: 2: S: cmd=barrier_out rc=0
1.500s: flux-shell[2]: TRACE: pmi-simple: 2: C: cmd=get kvsname=fEQLAnHv7 key=0-1-key0
1.500s: flux-shell[2]: TRACE: pmi-simple: 2: S: cmd=get_result rc=0 value=cG1peC5jcHVzZXQAMDMAMDAwNQAwLTk1AHBtaXguaG5hbWUAMDMAMDAwYQBjb3JvbmEyMDcAYnRsLm9wZW5pYi40LjEAMTQAMDAzNgABAAAAAAAAgP5bAAAABQAAAMkCAAAbEAAAAAEDPxjimwAAWwABAP6AAAAAAAAADEKhAwBarbxtdGwub2ZpLjQuMQAxNAAwMDIwAAM+nABbAAAAAAAAAAAAgP68rVoAA6FCDAAAAAAAAAAAcG1sLnVjeC40LjEAMTQAMDBkYwAg9gECSPg9FVIhEgECEZTtisloWbSYAADwAAAAAM8ZF7dROFOePkvX4Dc3ARMjAILmKyEGAALAqIDPzxkXt1E4ChouUCOtrjYAEyMAgpjjIQYAAsCoQM/PGRe3UTi/U+JMds/BNgETIwCCpPskgwBbAH45d8wrM+9IOFDbKjE1MsMjAAVAnAAxBCeKsA+hM+9IOFCwDyE1Mt8iAAB61HfMKzPvSDhQsA8hNTLfIgABADHRXHzhM+9IOFBFHSk1MgMnAANMnACzD3fMqzPvSDhQRR0pNTIDJwCDT5wAcG1sLnVjeC40LjEAMTQAMDE2NAAg9gECSPg9FVIgAGN1d8wrMgD4108AAAAAAE8DAIjub/N2VuY6fCESAQIRlO2KyWhZtJgAAPAAAAAAzxkXt1E4U54+S9fgNzcBEyMAguYrIQYAAsCogM/PGRe3UTgKGi5QI62uNgATIwCCmOMhBgACwKhAz88ZF7dROL9T4kx2z8E2ARMjAIKk+0IIEZTtisloWTSNU3fMKzIAAHBQd8yrMwBPIwCIOgANAAAAAABDCBGU7YrJaFk0p9N3zCsyAABwUHfMqzMATyMAiCy5PcAJAADAJAMAWwB+OXfMKzPvSDhQ2yoxNTLDIwAFQJwAMQQnirAPoTPvSDhQsA8hNTLfIgAAetR3zCsz70g4ULAPITUy3yIAAQAx0Vx84TPvSDhQRR0pNTIDJwADTJwAsw93zKsz70g4UEUdKTUyAycAg0+cACW
1.500s: flux-shell[2]: TRACE: pmi-simple: 2: C: cmd=get kvsname=fEQLAnHv7 key=0-1-key1
1.500s: flux-shell[2]: TRACE: pmi-simple: 2: S: cmd=get_result rc=0 value=IEZTtisloWTRHi703BjYAJC5Qd8yrMwCSIQCELLk9AA -
1.500s: flux-shell[2]: TRACE: pmi-simple: 2: C: cmd=barrier_in
0: fEQLAnHv7: completed MPI_Init in 1.379s. There are 3 tasks
0: fEQLAnHv7: completed first barrier in 0.016s
Sep 20 14:36:09.614316 broker.err[0]: rc2.0: /bin/bash Hangup (rc=129) 1796.2s
1800.980s: job.exception type=timeout severity=0 resource allocation expired
[detached: session exiting]
from flux-core.
Ah one red flag. The values being retrieved from the PMI KVS are exactly 1024 bytes in length. In simple PMI we have
#define SIMPLE_KVS_VAL_MAX 1024
Looking back to the puts, I see values of size 1053. Take this example, and note that the return code is 0 (success).
1.452s: flux-shell[1]: TRACE: pmi-simple: 1: C: cmd=put kvsname=fEQLAnHv7 key=0-1-key0 value=cG1peC5jcHVzZXQAMDMAMDAwNQAwLTk1AHBtaXguaG5hbWUAMDMAMDAwYQBjb3JvbmEyMDcAYnRsLm9wZW5pYi40LjEAMTQAMDAzNgABAAAAAAAAgP5bAAAABQAAAMkCAAAbEAAAAAEDPxjimwAAWwABAP6AAAAAAAAADEKhAwBarbxtdGwub2ZpLjQuMQAxNAAwMDIwAAM+nABbAAAAAAAAAAAAgP68rVoAA6FCDAAAAAAAAAAAcG1sLnVjeC40LjEAMTQAMDBkYwAg9gECSPg9FVIhEgECEZTtisloWbSYAADwAAAAAM8ZF7dROFOePkvX4Dc3ARMjAILmKyEGAALAqIDPzxkXt1E4ChouUCOtrjYAEyMAgpjjIQYAAsCoQM/PGRe3UTi/U+JMds/BNgETIwCCpPskgwBbAH45d8wrM+9IOFDbKjE1MsMjAAVAnAAxBCeKsA+hM+9IOFCwDyE1Mt8iAAB61HfMKzPvSDhQsA8hNTLfIgABADHRXHzhM+9IOFBFHSk1MgMnAANMnACzD3fMqzPvSDhQRR0pNTIDJwCDT5wAcG1sLnVjeC40LjEAMTQAMDE2NAAg9gECSPg9FVIgAGN1d8wrMgD4108AAAAAAE8DAIjub/N2VuY6fCESAQIRlO2KyWhZtJgAAPAAAAAAzxkXt1E4U54+S9fgNzcBEyMAguYrIQYAAsCogM/PGRe3UTgKGi5QI62uNgATIwCCmOMhBgACwKhAz88ZF7dROL9T4kx2z8E2ARMjAIKk+0IIEZTtisloWTSNU3fMKzIAAHBQd8yrMwBPIwCIOgANAAAAAABDCBGU7YrJaFk0p9N3zCsyAABwUHfMqzMATyMAiCy5PcAJAADAJAMAWwB+OXfMKzPvSDhQ2yoxNTLDIwAFQJwAMQQnirAPoTPvSDhQsA8hNTLfIgAAetR3zCsz70g4ULAPITUy3yIAAQAx0Vx84TPvSDhQRR0pNTIDJwADTJwAsw93zKsz70g4UEUdKTUyAycAg0+cACW
1.452s: flux-shell[1]: TRACE: pmi-simple: 1: S: cmd=put_result rc=0
from flux-core.
Wow, nice find!
from flux-core.
Eh except I must've shoved too much into wc
when I got 1053. The puts are actually logged as 1024 also. There might still be truncation elsewhere though as that's a mighty suspicious number.
from flux-core.
I'm likely barking up the wrong tree here. I think what openmpi is doing is checking the value of the max value and splitting up large values over multiple keys. For example on successful runs it commonly does stuff like this where the first value is 1024 (the max) and the second one is less.
2.176s: flux-shell[1]: TRACE: pmi-simple: 1: C: cmd=put kvsname=fFTnifWX key=0-1-key0 value=cG1peC5jcHVzZXQAMDMAMDAwNQAwLTQ3AHBtaXguaG5hbWUAMDMAMDAwOQBjb3JvbmE4MgBidGwub3BlbmliLjQuMQAxNAAwMDUwAAIAAAAAAACA/vIAAAAFAAAAyQIAABsQAAAAAQM/GArTAADyAAEA/oAAAAAAAACYA5sDAGevDAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAbXRsLm9maS40LjEAMTQAMDAyMAADHdQA8gAAAAAAAAAAAID+DK9nAAObA5gAAAAAAAAAAHBtbC51Y3guNC4xADE0ADAxNzkAIKuI7ZFSG8YuIRIBAi62t6Wf/VHdmAAA8AAAAADPGRe3UThTnj5L1+A3NwETIwCC4lUhBgACwKiAUs8ZF7dROAoaLlAjra42ARMjAILnryEGAAKGCUcUzxkXt1E4bdGwUGaUrjYAEyMAgsPXIQYAAsCoQFLPGRe3UTi/U+JMds/BNgETIwCCvOUkAwDyAH45d8wrM+9IOFDbKjE1MsMjAAUz1AAzBCeKsA+hM+9IOFCwDyE1Mt8iAAB61HfMKzPvSDhQsA8hNTLfIgABADHRXHzhM+9IOFBFHSk1MgMnAANO1ACzD3fMqzPvSDhQRR0pNTIDJwCDUdQAJdICSgAAAAAAAAAAAAD//4YJRxQDfjl3zCszXSBSUMDaZjUywyMABYrkAHAEJ4qwD6EzXSBSUJW/VjUy3yIAAHrUd8wrM10gUlCVv1Y1Mt8iAAEAMdFcfOEzXSBSUCrNXjUyAycAA7rkALMPd8yrM10gUlAqzV41MgMnAIO95ABwbWwudWN4LjQuMQAxNAAwMjAxACCriO2RUhvGLiAAY3V3zCsyAPjXTwAAAAAATwMAiOtsPc4eezIuIRIBAi62t6Wf/VHdmAAA8AAAAADPGRe3UThTnj5L1+A3NwETIwCC4lUhBgACwKiAUs8ZF7dROAoaLlAjra42ARMjAILnryEGAAKGCUcUzxkXt1E4bdGwUGaUrjYAEyMAgsPXIQYAAsC
2.176s: flux-shell[1]: TRACE: pmi-simple: 1: S: cmd=put_result rc=0
2.183s: flux-shell[1]: TRACE: pmi-simple: 1: C: cmd=put kvsname=fFTnifWX key=0-1-key1 value=oQFLPGRe3UTi/U+JMds/BNgETIwCCvOVCCC62t6Wf/VFdjVN3zCsyAABwUHfMqzMATyMAiAYAKAAAAAAAQwgutreln/1RXafTd8wrMgAAcFB3zKszAE8jAIhOchdADAAAwCQDAPIAfjl3zCsz70g4UNsqMTUywyMABTPUADMEJ4qwD6Ez70g4ULAPITUy3yIAAHrUd8wrM+9IOFCwDyE1Mt8iAAEAMdFcfOEz70g4UEUdKTUyAycAA07UALMPd8yrM+9IOFBFHSk1MgMnAINR1AAlUgJKAAAAAAAAAAAAAP//hglHFAN+OXfMKzNdIFJQwNpmNTLDIwAFiuQAcAQnirAPoTNdIFJQlb9WNTLfIgAAetR3zCszXSBSUJW/VjUy3yIAAQAx0Vx84TNdIFJQKs1eNTIDJwADuuQAsw93zKszXSBSUCrNXjUyAycAg73kACaILra3pZ/9UV1Hi703BjYAJC5Qd8yrMwCSIQCETnIXAA -
2.183s: flux-shell[1]: TRACE: pmi-simple: 1: S: cmd=put_result rc=0
from flux-core.
Looking at the ompi source referenced above it seems that opal_common_ucx_del_procs()
first calls opal_common_ucx_del_procs_nofence()
, then calls the opal_common_ucx_mca_pmix_fence()
(which executes the pmi barrier). Traces show one rank that made it all the way to the PMI barrier, and the other ranks stuck in opal_common_ucx_del_procs_nofence()
-> opal_common_ucx_wait_all_requests()
.
So basically if the latter ever finishes, the straggler rank will enter the PMI barrier.
Turning on some debug and reproducing the hang (and snipping out all the ucx debug except the disconnect stuff):
$ flux run --label-io -N3 --env=OMPI_MCA_pml_ucx_verbose=3 ./hello
0: f2AZvtYab: completed MPI_Init in 1.384s. There are 3 tasks
0: f2AZvtYab: completed first barrier in 0.015s
0: [corona188:3967920] common_ucx.c:440 disconnecting from rank 0
0: [corona188:3967920] common_ucx.c:440 disconnecting from rank 1
0: [corona188:3967920] common_ucx.c:404 waiting for 1 disconnect requests
2: [corona198:65489] common_ucx.c:440 disconnecting from rank 2
2: [corona198:65489] common_ucx.c:440 disconnecting from rank 0
0: [corona188:3967920] common_ucx.c:440 disconnecting from rank 2
2: [corona198:65489] common_ucx.c:404 waiting for 1 disconnect requests
0: [corona188:3967920] common_ucx.c:404 waiting for 1 disconnect requests
1: [corona189:3811294] common_ucx.c:440 disconnecting from rank 0
2: [corona198:65489] common_ucx.c:404 waiting for 0 disconnect requests
1: [corona189:3811294] common_ucx.c:404 waiting for 1 disconnect requests
1: [corona189:3811294] common_ucx.c:440 disconnecting from rank 1
1: [corona189:3811294] common_ucx.c:404 waiting for 0 disconnect requests
[hang]
Looking at the ranks with gdb:
- rank 0 is stuck in
opal_common_ucx_wait_all_requests ()
- rank 1 is stuck in
pmi_simple_client_barrier ()
- rank 2 is stuck in
pmi_simple_client_barrier ()
Here's the same test run successfully with pmix, just the disconnect messages
0: [corona188:3968360] common_ucx.c:440 disconnecting from rank 0
0: [corona188:3968360] common_ucx.c:440 disconnecting from rank 1
0: [corona188:3968360] common_ucx.c:404 waiting for 1 disconnect requests
2: [corona198:65924] common_ucx.c:440 disconnecting from rank 2
0: [corona188:3968360] common_ucx.c:440 disconnecting from rank 2
0: [corona188:3968360] common_ucx.c:404 waiting for 1 disconnect requests
1: [corona189:3811731] common_ucx.c:440 disconnecting from rank 0
0: [corona188:3968360] common_ucx.c:404 waiting for 0 disconnect requests
2: [corona198:65924] common_ucx.c:440 disconnecting from rank 0
1: [corona189:3811731] common_ucx.c:404 waiting for 1 disconnect requests
2: [corona198:65924] common_ucx.c:404 waiting for 1 disconnect requests
1: [corona189:3811731] common_ucx.c:440 disconnecting from rank 1
2: [corona198:65924] common_ucx.c:404 waiting for 0 disconnect requests
1: [corona189:3811731] common_ucx.c:404 waiting for 0 disconnect requests
As far as I can tell, the only difference is that in the failing trace, rank 0 never gets down to "waiting for 0 disconnect requests" even though it would appear that the other two ranks have sent them.
from flux-core.
As noted above, each rank calls opal_common_ucx_del_procs()
, which sequentially calls
opal_common_ucx_del_procs_nofence()
opal_common_ucx_mca_pmix_fence()
Looking a bit deeper
opal_common_ucx_del_procs_nofence()
sends disconnect requests withucp_disconnect_nb()
opal_common_ucx_mca_pmix_fence()
enters the PMI barrier by callingopal_pmix.fence_nb()
- after calling
fence_nb()
a "progress loop" is entered.
The _nb versions of these functions are supposed to be non blocking, and UCX is making full use of this to concurrently wait for the barrier and get the disconnect requests sent in the progress loop.
However, the flux MCA plugin in openmpi implements fence_nb()
in a blocking manner because it's based on PMI-1 and PMI-1 does not offer non-blocking interfaces. So a theory is that some disconnect messages don't make it out because the (unfortunately blocking) fence_nb()
is called before the progress loop can process a queue of outbound messages.
Reworking the flux MCA plugin in openmpi to make the fence_nb()
truly non-blocking is a nontrivial effort, and once complete would require rebuilding all the ompis. Given that and the fact that ompi dropped the flux MCA plugin in v5 and requires all launchers to implement pmix, running ompi + UCX with under flux with -o pmi=pmix
really is the preferred solution to this.
from flux-core.
Related Issues (20)
- valgrind reports new leaks with hwloc 2.10 HOT 2
- broker crash in `content_cache_destroy` HOT 2
- `job-ingest: fluid_init failed` when launching 16 brokers per node on 1040 nodes HOT 7
- discussion: go bindings for flux-core HOT 3
- Cray MPI launches as singletons on frontier HOT 10
- shell: support dumping hwloc XML to `HWLOC_XMLFILE`
- Need a bulk submission tool for `flux batch`
- shell taskmap `block` scheme ignores its arguments
- flux-job: support `MPIR_executable_path` and `MPIR_server_arguments` in attach HOT 1
- Flux RADIUSS Tutorial Discussion Issue HOT 3
- liboptparse segfaults with duplicate subcommand option table entries
- docker-run-systest.sh does not work anymore
- nodes are drained when a user aborts a run request with prolog running HOT 11
- flux-shell: ERROR: output: shell_output_write: Function not implemented
- when a user aborts a job early, the prolog script may get SIGTERM
- make all jobs "waitable" HOT 1
- tracking issue: standby/preemptible jobs HOT 2
- idea: use host constraint for queues instead of properties
- TOSS 4 non-TCE openmpi: Failed to open drm root directory /sys/class/drm.: No such file or directory HOT 3
- job shell blocks at exit in degraded job HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from flux-core.