Giter VIP home page Giter VIP logo

Comments (15)

wenduwan avatar wenduwan commented on August 17, 2024

@sdonoso We have ingested multiple runtime fixes since 5.0.1. Can you reproduce on 5.0.3?

For example, we fixed this a while ago #12064

from ompi.

sdonoso avatar sdonoso commented on August 17, 2024

I am using version 5.0.3

rene@puente:~/nccl-tests$ mpirun --version
mpirun (Open MPI) 5.0.3

from ompi.

wenduwan avatar wenduwan commented on August 17, 2024

Thanks. I updated the issue title.

from ompi.

janjust avatar janjust commented on August 17, 2024

@sdonoso any chance your LD_LIBRARY_PATH isn't propagated to the other node?
If you add your MPI libs into the path and forward it via -x LD_LIBRARY_PATH?

from ompi.

sdonoso avatar sdonoso commented on August 17, 2024

what do you mean with that the LD_LIBRARY_PATH is not propagated?

rene@puente:~/nccl-tests$ mpirun -x  LD_LIBRARY_PATH='usr/local/openmpi/lib':$LD_LIBRARY_PATH -hostfile hostfile -np 2 --mca pml ucx  --map-by ppr:1:node ./hello_world
Hello world from rank 1 out of 2 processors

I have the same result

rene@puente:~/nccl-tests$ mpirun -x  LD_LIBRARY_PATH='usr/local/openmpi/lib':$LD_LIBRARY_PATH --map-by ppr:1:node --prtemca plm_base_verbose 100 --prtemca rmaps_base_verbose 100 --display alloc hostname
[puente:218375] mca: base: component_find: searching NULL for plm components
[puente:218375] mca: base: find_dyn_components: checking NULL for plm components
[puente:218375] pmix:mca: base: components_register: registering framework plm components
[puente:218375] pmix:mca: base: components_register: found loaded component slurm
[puente:218375] pmix:mca: base: components_register: component slurm register function successful
[puente:218375] pmix:mca: base: components_register: found loaded component ssh
[puente:218375] pmix:mca: base: components_register: component ssh register function successful
[puente:218375] mca: base: components_open: opening plm components
[puente:218375] mca: base: components_open: found loaded component slurm
[puente:218375] mca: base: components_open: component slurm open function successful
[puente:218375] mca: base: components_open: found loaded component ssh
[puente:218375] mca: base: components_open: component ssh open function successful
[puente:218375] mca:base:select: Auto-selecting plm components
[puente:218375] mca:base:select:(  plm) Querying component [slurm]
[puente:218375] mca:base:select:(  plm) Querying component [ssh]
[puente:218375] [[INVALID],UNDEFINED] plm:ssh_lookup on agent ssh : rsh path NULL
[puente:218375] mca:base:select:(  plm) Query of component [ssh] set priority to 10
[puente:218375] mca:base:select:(  plm) Selected component [ssh]
[puente:218375] mca: base: close: component slurm closed
[puente:218375] mca: base: close: unloading component slurm
[puente:218375] [prterun-puente-218375@0,0] plm:ssh_setup on agent ssh : rsh path NULL
[puente:218375] [prterun-puente-218375@0,0] plm:base:receive start comm
[puente:218375] mca: base: component_find: searching NULL for rmaps components
[puente:218375] mca: base: find_dyn_components: checking NULL for rmaps components
[puente:218375] pmix:mca: base: components_register: registering framework rmaps components
[puente:218375] pmix:mca: base: components_register: found loaded component ppr
[puente:218375] pmix:mca: base: components_register: component ppr register function successful
[puente:218375] pmix:mca: base: components_register: found loaded component rank_file
[puente:218375] pmix:mca: base: components_register: component rank_file has no register or open function
[puente:218375] pmix:mca: base: components_register: found loaded component round_robin
[puente:218375] pmix:mca: base: components_register: component round_robin register function successful
[puente:218375] pmix:mca: base: components_register: found loaded component seq
[puente:218375] pmix:mca: base: components_register: component seq register function successful
[puente:218375] mca: base: components_open: opening rmaps components
[puente:218375] mca: base: components_open: found loaded component ppr
[puente:218375] mca: base: components_open: component ppr open function successful
[puente:218375] mca: base: components_open: found loaded component rank_file
[puente:218375] mca: base: components_open: found loaded component round_robin
[puente:218375] mca: base: components_open: component round_robin open function successful
[puente:218375] mca: base: components_open: found loaded component seq
[puente:218375] mca: base: components_open: component seq open function successful
[puente:218375] mca:rmaps:select: checking available component ppr
[puente:218375] mca:rmaps:select: Querying component [ppr]
[puente:218375] mca:rmaps:select: checking available component rank_file
[puente:218375] mca:rmaps:select: Querying component [rank_file]
[puente:218375] mca:rmaps:select: checking available component round_robin
[puente:218375] mca:rmaps:select: Querying component [round_robin]
[puente:218375] mca:rmaps:select: checking available component seq
[puente:218375] mca:rmaps:select: Querying component [seq]
[puente:218375] [prterun-puente-218375@0,0]: Final mapper priorities
[puente:218375] 	Mapper: rank_file Priority: 100
[puente:218375] 	Mapper: ppr Priority: 90
[puente:218375] 	Mapper: seq Priority: 60
[puente:218375] 	Mapper: round_robin Priority: 10
[puente:218375] [prterun-puente-218375@0,0] plm:base:setup_vm
[puente:218375] [prterun-puente-218375@0,0] plm:base:setup_vm creating map
[puente:218375] [prterun-puente-218375@0,0] setup:vm: working unmanaged allocation
[puente:218375] [prterun-puente-218375@0,0] using default hostfile /usr/local/openmpi/etc/prte-default-hostfile
[puente:218375] [prterun-puente-218375@0,0] plm:base:setup_vm only HNP in allocation
[puente:218375] [prterun-puente-218375@0,0] plm:base:setting slots for node puente by core

======================   ALLOCATED NODES   ======================
    puente: slots=1 max_slots=0 slots_inuse=0 state=UP
	Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED
	aliases: puente
=================================================================

======================   ALLOCATED NODES   ======================
    puente: slots=128 max_slots=0 slots_inuse=0 state=UP
	Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN
	aliases: puente
=================================================================
[puente:218375] [prterun-puente-218375@0,0] rmaps:base set policy with ppr:1:node
[puente:218375] [prterun-puente-218375@0,0] rmaps:base policy ppr modifiers 1:node provided
[puente:218375] [prterun-puente-218375@0,0] plm:base:receive processing msg
[puente:218375] [prterun-puente-218375@0,0] plm:base:receive job launch command from [prterun-puente-218375@0,0]
[puente:218375] [prterun-puente-218375@0,0] plm:base:receive adding hosts
[puente:218375] [prterun-puente-218375@0,0] plm:base:receive calling spawn
[puente:218375] [prterun-puente-218375@0,0] plm:base:receive done processing commands
[puente:218375] [prterun-puente-218375@0,0] plm:base:setup_job
[puente:218375] [prterun-puente-218375@0,0] plm:base:setup_vm
[puente:218375] [prterun-puente-218375@0,0] plm:base:setup_vm no new daemons required
[puente:218375] mca:rmaps: mapping job prterun-puente-218375@1
[puente:218375] mca:rmaps: setting mapping policies for job prterun-puente-218375@1 inherit TRUE hwtcpus FALSE
[puente:218375] [prterun-puente-218375@0,0] using known nodes
[puente:218375] [prterun-puente-218375@0,0] Starting with 1 nodes in list
[puente:218375] [prterun-puente-218375@0,0] Filtering thru apps
[puente:218375] [prterun-puente-218375@0,0] Retained 1 nodes in list
[puente:218375] [prterun-puente-218375@0,0] node puente has 128 slots available
[puente:218375] AVAILABLE NODES FOR MAPPING:
[puente:218375]     node: puente daemon: 0 slots_available: 128

[puente:218375] setdefaultbinding[366] binding not given - using bycore
======================   ALLOCATED NODES   ======================
[puente:218375] mca:rmaps:rf: job prterun-puente-218375@1 not using rankfile policy
    puente: slots=128 max_slots=0 slots_inuse=0 state=UP
[puente:218375] mca:rmaps:ppr: mapping job prterun-puente-218375@1 with ppr 1:node
	Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN
[puente:218375] mca:rmaps:ppr: job prterun-puente-218375@1 assigned policy BYNODE:SLOT
	aliases: puente
[puente:218375] [prterun-puente-218375@0,0] using known nodes
=================================================================
[puente:218375] [prterun-puente-218375@0,0] Starting with 1 nodes in list
[puente:218375] [prterun-puente-218375@0,0] Filtering thru apps
[puente:218375] [prterun-puente-218375@0,0] Retained 1 nodes in list
[puente:218375] [prterun-puente-218375@0,0] node puente has 128 slots available
[puente:218375] AVAILABLE NODES FOR MAPPING:
[puente:218375]     node: puente daemon: 0 slots_available: 128
[puente:218375] [prterun-puente-218375@0,0] get_avail_ncpus: node puente has 0 procs on it
[puente:218375] mca:rmaps: compute bindings for job prterun-puente-218375@1 with policy CORE:IF-SUPPORTED[1007]
[puente:218375] mca:rmaps: bind [prterun-puente-218375@1,INVALID] with policy CORE:IF-SUPPORTED
[puente:218375] [prterun-puente-218375@0,0] BOUND PROC [prterun-puente-218375@1,INVALID][puente] TO package[0][core:0]
[puente:218375] [prterun-puente-218375@0,0] complete_setup on job prterun-puente-218375@1
[puente:218375] [prterun-puente-218375@0,0] plm:base:launch_apps for job prterun-puente-218375@1
[puente:218375] [prterun-puente-218375@0,0] plm:base:send launch msg for job prterun-puente-218375@1
[puente:218375] [prterun-puente-218375@0,0] plm:base:launch wiring up iof for job prterun-puente-218375@1
puente
[puente:218375] [prterun-puente-218375@0,0] plm:base:prted_cmd sending prted_exit commands
[puente:218375] [prterun-puente-218375@0,0] plm:base:receive stop comm
[puente:218375] mca: base: close: component ssh closed
[puente:218375] mca: base: close: unloading component ssh

from ompi.

janjust avatar janjust commented on August 17, 2024

Sometimes (I don't remember the circumstance) if the LD_LIBRARY_PATH is not forwarded to other nodes hangs and other weird behavior is possible, so this is usually the first I try to rule that out.
Looks like your issue is something else.

from ompi.

rhc54 avatar rhc54 commented on August 17, 2024

The problem is here: -np 2 --mca pml ucx --map-by ppr:1:node

You only have one node in your system, and you tell us to launch 1 process/node - but ask us to launch TWO procs. Logically impossible. We should have immediately error'd out, so that's the bug - but this cmd cannot succeed.

from ompi.

janjust avatar janjust commented on August 17, 2024

He has a hostile in his previous command, I assumed two nodes are listed in it.

from ompi.

rhc54 avatar rhc54 commented on August 17, 2024

Yeah, it's nearly impossible to triage this one. The cmds keep varying, some are inconsistent with the reported output, etc. Probably need to ask that the user be more careful in what they are reporting.

from ompi.

sdonoso avatar sdonoso commented on August 17, 2024

I have two nodes connected by infiniband, and also i can ssh between the nodes without the password prompt.

from ompi.

rhc54 avatar rhc54 commented on August 17, 2024

Your reported debug output shows only ONE node in your allocation:

======================   ALLOCATED NODES   ======================
    puente: slots=1 max_slots=0 slots_inuse=0 state=UP
	Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED
	aliases: puente
=================================================================

Hence the confusion. I think you are perhaps not being careful in showing the results from what is probably a bunch of runs, and the output doesn't always match the posted cmd.

from ompi.

sdonoso avatar sdonoso commented on August 17, 2024

sorry, i miss pass the hostfile

mpirun -x  LD_LIBRARY_PATH='usr/local/openmpi/lib':$LD_LIBRARY_PATH -hostfile nccl-tests/hostfile  --map-by ppr:1:node --prtemca plm_base_verbose 100 --prtemca rmaps_base_verbose 100 --display alloc hostname
[puente:220398] mca: base: component_find: searching NULL for plm components
[puente:220398] mca: base: find_dyn_components: checking NULL for plm components
[puente:220398] pmix:mca: base: components_register: registering framework plm components
[puente:220398] pmix:mca: base: components_register: found loaded component slurm
[puente:220398] pmix:mca: base: components_register: component slurm register function successful
[puente:220398] pmix:mca: base: components_register: found loaded component ssh
[puente:220398] pmix:mca: base: components_register: component ssh register function successful
[puente:220398] mca: base: components_open: opening plm components
[puente:220398] mca: base: components_open: found loaded component slurm
[puente:220398] mca: base: components_open: component slurm open function successful
[puente:220398] mca: base: components_open: found loaded component ssh
[puente:220398] mca: base: components_open: component ssh open function successful
[puente:220398] mca:base:select: Auto-selecting plm components
[puente:220398] mca:base:select:(  plm) Querying component [slurm]
[puente:220398] mca:base:select:(  plm) Querying component [ssh]
[puente:220398] [[INVALID],UNDEFINED] plm:ssh_lookup on agent ssh : rsh path NULL
[puente:220398] mca:base:select:(  plm) Query of component [ssh] set priority to 10
[puente:220398] mca:base:select:(  plm) Selected component [ssh]
[puente:220398] mca: base: close: component slurm closed
[puente:220398] mca: base: close: unloading component slurm
[puente:220398] [prterun-puente-220398@0,0] plm:ssh_setup on agent ssh : rsh path NULL
[puente:220398] [prterun-puente-220398@0,0] plm:base:receive start comm
[puente:220398] mca: base: component_find: searching NULL for rmaps components
[puente:220398] mca: base: find_dyn_components: checking NULL for rmaps components
[puente:220398] pmix:mca: base: components_register: registering framework rmaps components
[puente:220398] pmix:mca: base: components_register: found loaded component ppr
[puente:220398] pmix:mca: base: components_register: component ppr register function successful
[puente:220398] pmix:mca: base: components_register: found loaded component rank_file
[puente:220398] pmix:mca: base: components_register: component rank_file has no register or open function
[puente:220398] pmix:mca: base: components_register: found loaded component round_robin
[puente:220398] pmix:mca: base: components_register: component round_robin register function successful
[puente:220398] pmix:mca: base: components_register: found loaded component seq
[puente:220398] pmix:mca: base: components_register: component seq register function successful
[puente:220398] mca: base: components_open: opening rmaps components
[puente:220398] mca: base: components_open: found loaded component ppr
[puente:220398] mca: base: components_open: component ppr open function successful
[puente:220398] mca: base: components_open: found loaded component rank_file
[puente:220398] mca: base: components_open: found loaded component round_robin
[puente:220398] mca: base: components_open: component round_robin open function successful
[puente:220398] mca: base: components_open: found loaded component seq
[puente:220398] mca: base: components_open: component seq open function successful
[puente:220398] mca:rmaps:select: checking available component ppr
[puente:220398] mca:rmaps:select: Querying component [ppr]
[puente:220398] mca:rmaps:select: checking available component rank_file
[puente:220398] mca:rmaps:select: Querying component [rank_file]
[puente:220398] mca:rmaps:select: checking available component round_robin
[puente:220398] mca:rmaps:select: Querying component [round_robin]
[puente:220398] mca:rmaps:select: checking available component seq
[puente:220398] mca:rmaps:select: Querying component [seq]
[puente:220398] [prterun-puente-220398@0,0]: Final mapper priorities
[puente:220398] 	Mapper: rank_file Priority: 100
[puente:220398] 	Mapper: ppr Priority: 90
[puente:220398] 	Mapper: seq Priority: 60
[puente:220398] 	Mapper: round_robin Priority: 10
[puente:220398] [prterun-puente-220398@0,0] plm:base:setup_vm
[puente:220398] [prterun-puente-220398@0,0] plm:base:setup_vm creating map
[puente:220398] [prterun-puente-220398@0,0] setup:vm: working unmanaged allocation
[puente:220398] [prterun-puente-220398@0,0] using hostfile /home/rene/nccl-tests/hostfile
[puente:220398] [prterun-puente-220398@0,0] checking node 146.155.155.83
[puente:220398] [prterun-puente-220398@0,0] ignoring myself
[puente:220398] [prterun-puente-220398@0,0] checking node 146.155.155.84
[puente:220398] [prterun-puente-220398@0,0] plm:base:setup_vm add new daemon [prterun-puente-220398@0,1]
[puente:220398] [prterun-puente-220398@0,0] plm:base:setup_vm assigning new daemon [prterun-puente-220398@0,1] to node 146.155.155.84
[puente:220398] [prterun-puente-220398@0,0] plm:ssh: launching vm

======================   ALLOCATED NODES   ======================
    puente: slots=8 max_slots=0 slots_inuse=0 state=UP
	Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN
	aliases: 146.155.155.83
    146.155.155.84: slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
	Flags: SLOTS_GIVEN
	aliases: NONE
=================================================================
[puente:220398] [prterun-puente-220398@0,0] plm:ssh: local shell: 0 (bash)
[puente:220398] [prterun-puente-220398@0,0] plm:ssh: assuming same remote shell as local shell
[puente:220398] [prterun-puente-220398@0,0] plm:ssh: remote shell: 0 (bash)
[puente:220398] [prterun-puente-220398@0,0] plm:ssh: final template argv:
	/usr/bin/ssh <template> PRTE_PREFIX=/usr/local/openmpi;export PRTE_PREFIX;LD_LIBRARY_PATH=/usr/local/openmpi/lib:/usr/local/openmpi/lib:$LD_LIBRARY_PATH;export LD_LIBRARY_PATH;DYLD_LIBRARY_PATH=/usr/local/openmpi/lib:/usr/local/openmpi/lib:$DYLD_LIBRARY_PATH;export DYLD_LIBRARY_PATH;/usr/local/openmpi/bin/prted --prtemca ess "env" --prtemca ess_base_nspace "prterun-puente-220398@0" --prtemca ess_base_vpid "<template>" --prtemca ess_base_num_procs "2" --prtemca prte_hnp_uri "[email protected];tcp://146.155.155.83:42011:28" --prtemca plm_base_verbose "100" --prtemca rmaps_base_verbose "100" --prtemca pmix_session_server "1" --prtemca plm "ssh" --tree-spawn --prtemca prte_parent_uri "[email protected];tcp://146.155.155.83:42011:28"
[puente:220398] [prterun-puente-220398@0,0] plm:ssh:launch daemon 0 not a child of mine
[puente:220398] [prterun-puente-220398@0,0] plm:ssh: adding node 146.155.155.84 to launch list
[puente:220398] [prterun-puente-220398@0,0] plm:ssh: activating launch event
[puente:220398] [prterun-puente-220398@0,0] plm:ssh: recording launch of daemon [prterun-puente-220398@0,1]
[puente:220398] [prterun-puente-220398@0,0] plm:ssh: executing: (/usr/bin/ssh) [/usr/bin/ssh 146.155.155.84 PRTE_PREFIX=/usr/local/openmpi;export PRTE_PREFIX;LD_LIBRARY_PATH=/usr/local/openmpi/lib:/usr/local/openmpi/lib:$LD_LIBRARY_PATH;export LD_LIBRARY_PATH;DYLD_LIBRARY_PATH=/usr/local/openmpi/lib:/usr/local/openmpi/lib:$DYLD_LIBRARY_PATH;export DYLD_LIBRARY_PATH;/usr/local/openmpi/bin/prted --prtemca ess "env" --prtemca ess_base_nspace "prterun-puente-220398@0" --prtemca ess_base_vpid 1 --prtemca ess_base_num_procs "2" --prtemca prte_hnp_uri "[email protected];tcp://146.155.155.83:42011:28" --prtemca plm_base_verbose "100" --prtemca rmaps_base_verbose "100" --prtemca pmix_session_server "1" --prtemca plm "ssh" --tree-spawn --prtemca prte_parent_uri "[email protected];tcp://146.155.155.83:42011:28"]
[puente:220398] [prterun-puente-220398@0,0] plm:base:orted_report_launch from daemon [prterun-puente-220398@0,1]
[puente:220398] [prterun-puente-220398@0,0] plm:base:orted_report_launch from daemon [prterun-puente-220398@0,1] on node kalila
[puente:220398] ALIASES FOR NODE kalila (kalila)
[puente:220398] 	ALIAS: 146.155.155.84
[puente:220398] [prterun-puente-220398@0,0] RECEIVED TOPOLOGY SIG 2N:2S:16L3:128L2:128L1:128C:255H:0-254:0-255:x86_64:le FROM NODE kalila
[puente:220398] [prterun-puente-220398@0,0] NEW TOPOLOGY - ADDING SIGNATURE
[puente:220398] [prterun-puente-220398@0,0] plm:base:orted_report_launch completed for daemon [prterun-puente-220398@0,1] at contact [email protected];tcp://146.155.155.84:42405:28
[puente:220398] [prterun-puente-220398@0,0] plm:base:orted_report_launch job prterun-puente-220398@0 recvd 2 of 2 reported daemons

======================   ALLOCATED NODES   ======================
    puente: slots=8 max_slots=0 slots_inuse=0 state=UP
	Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN
	aliases: 146.155.155.83
    kalila: slots=8 max_slots=0 slots_inuse=0 state=UP
	Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN
	aliases: 146.155.155.84
=================================================================
[puente:220398] [prterun-puente-220398@0,0] rmaps:base set policy with ppr:1:node
[puente:220398] [prterun-puente-220398@0,0] rmaps:base policy ppr modifiers 1:node provided
[puente:220398] [prterun-puente-220398@0,0] plm:base:receive processing msg
[puente:220398] [prterun-puente-220398@0,0] plm:base:receive job launch command from [prterun-puente-220398@0,0]
[puente:220398] [prterun-puente-220398@0,0] plm:base:receive adding hosts
[puente:220398] [prterun-puente-220398@0,0] plm:base:receive calling spawn
[puente:220398] [prterun-puente-220398@0,0] plm:base:receive done processing commands
[puente:220398] [prterun-puente-220398@0,0] plm:base:setup_job
[puente:220398] [prterun-puente-220398@0,0] plm:base:setup_vm
[puente:220398] [prterun-puente-220398@0,0] plm_base:setup_vm NODE kalila WAS NOT ADDED
[puente:220398] [prterun-puente-220398@0,0] plm:base:setup_vm no new daemons required
[puente:220398] mca:rmaps: mapping job prterun-puente-220398@1
[puente:220398] mca:rmaps: setting mapping policies for job prterun-puente-220398@1 inherit TRUE hwtcpus FALSE
[puente:220398] [prterun-puente-220398@0,0] using hostfile /home/rene/nccl-tests/hostfile

======================   ALLOCATED NODES   ======================
    puente: slots=8 max_slots=0 slots_inuse=0 state=UP
[puente:220398] NODE puente DOESNT MATCH NODE 146.155.155.84
	Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN
[puente:220398] [prterun-puente-220398@0,0] node puente has 8 slots available
	aliases: 146.155.155.83
[puente:220398] [prterun-puente-220398@0,0] node kalila has 8 slots available
    kalila: slots=8 max_slots=0 slots_inuse=0 state=UP
[puente:220398] AVAILABLE NODES FOR MAPPING:
	Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN
[puente:220398]     node: puente daemon: 0 slots_available: 8
	aliases: 146.155.155.84
[puente:220398]     node: kalila daemon: 1 slots_available: 8
=================================================================
[puente:220398] setdefaultbinding[366] binding not given - using bycore
[puente:220398] mca:rmaps:rf: job prterun-puente-220398@1 not using rankfile policy
[puente:220398] mca:rmaps:ppr: mapping job prterun-puente-220398@1 with ppr 1:node
[puente:220398] mca:rmaps:ppr: job prterun-puente-220398@1 assigned policy BYNODE:SLOT
[puente:220398] [prterun-puente-220398@0,0] using hostfile /home/rene/nccl-tests/hostfile
[puente:220398] NODE puente DOESNT MATCH NODE 146.155.155.84
[puente:220398] [prterun-puente-220398@0,0] node puente has 8 slots available
[puente:220398] [prterun-puente-220398@0,0] node kalila has 8 slots available
[puente:220398] AVAILABLE NODES FOR MAPPING:
[puente:220398]     node: puente daemon: 0 slots_available: 8
[puente:220398]     node: kalila daemon: 1 slots_available: 8
[puente:220398] [prterun-puente-220398@0,0] get_avail_ncpus: node puente has 0 procs on it
[puente:220398] mca:rmaps: compute bindings for job prterun-puente-220398@1 with policy CORE:IF-SUPPORTED[1007]
[puente:220398] mca:rmaps: bind [prterun-puente-220398@1,INVALID] with policy CORE:IF-SUPPORTED
[puente:220398] [prterun-puente-220398@0,0] BOUND PROC [prterun-puente-220398@1,INVALID][puente] TO package[0][core:0]
[puente:220398] [prterun-puente-220398@0,0] get_avail_ncpus: node kalila has 0 procs on it
[puente:220398] mca:rmaps: compute bindings for job prterun-puente-220398@1 with policy CORE:IF-SUPPORTED[1007]
[puente:220398] mca:rmaps: bind [prterun-puente-220398@1,INVALID] with policy CORE:IF-SUPPORTED
[puente:220398] [prterun-puente-220398@0,0] BOUND PROC [prterun-puente-220398@1,INVALID][kalila] TO package[0][core:0]
[puente:220398] [prterun-puente-220398@0,0] complete_setup on job prterun-puente-220398@1
[puente:220398] [prterun-puente-220398@0,0] plm:base:launch_apps for job prterun-puente-220398@1
[puente:220398] [prterun-puente-220398@0,0] plm:base:send launch msg for job prterun-puente-220398@1
puente
[puente:220398] [prterun-puente-220398@0,0] plm:base:receive processing msg
[puente:220398] [prterun-puente-220398@0,0] plm:base:receive local launch complete command from [prterun-puente-220398@0,1]
[puente:220398] [prterun-puente-220398@0,0] plm:base:receive got local launch complete for job prterun-puente-220398@1
[puente:220398] [prterun-puente-220398@0,0] plm:base:receive got local launch complete for vpid 1
[puente:220398] [prterun-puente-220398@0,0] plm:base:receive got local launch complete for vpid 1 state RUNNING
[puente:220398] [prterun-puente-220398@0,0] plm:base:receive done processing commands
[puente:220398] [prterun-puente-220398@0,0] plm:base:launch wiring up iof for job prterun-puente-220398@1
kalila
[puente:220398] [prterun-puente-220398@0,0] plm:base:receive processing msg
[puente:220398] 

[prterun-puente-220398@0,0] plm:base:receive update proc state command from [prterun-puente-220398@0,1]

[puente:220398] [prterun-puente-220398@0,0] plm:base:receive got update_proc_state for job prterun-puente-220398@1
[puente:220398] [prterun-puente-220398@0,0] plm:base:receive got update_proc_state for vpid 1 pid 577327 state NORMALLY TERMINATED exit_code 0
[puente:220398] [prterun-puente-220398@0,0] plm:base:receive done processing commands
[puente:220398] [prterun-puente-220398@0,0] plm:base:prted_cmd sending prted_exit commands
[puente:220398] [prterun-puente-220398@0,0] plm:base:receive stop comm
[puente:220398] mca: base: close: component ssh closed
[puente:220398] mca: base: close: unloading component ssh

from ompi.

sdonoso avatar sdonoso commented on August 17, 2024

And the output of the hello_world

rene@puente:~/nccl-tests$ mpirun -x  LD_LIBRARY_PATH='usr/local/openmpi/lib':$LD_LIBRARY_PATH -np 2 -hostfile hostfile  --map-by ppr:1:node --prtemca plm_base_verbose 100 --prtemca rmaps_base_verbose 100 --display alloc ./hello_world
[puente:222235] mca: base: component_find: searching NULL for plm components
[puente:222235] mca: base: find_dyn_components: checking NULL for plm components
[puente:222235] pmix:mca: base: components_register: registering framework plm components
[puente:222235] pmix:mca: base: components_register: found loaded component slurm
[puente:222235] pmix:mca: base: components_register: component slurm register function successful
[puente:222235] pmix:mca: base: components_register: found loaded component ssh
[puente:222235] pmix:mca: base: components_register: component ssh register function successful
[puente:222235] mca: base: components_open: opening plm components
[puente:222235] mca: base: components_open: found loaded component slurm
[puente:222235] mca: base: components_open: component slurm open function successful
[puente:222235] mca: base: components_open: found loaded component ssh
[puente:222235] mca: base: components_open: component ssh open function successful
[puente:222235] mca:base:select: Auto-selecting plm components
[puente:222235] mca:base:select:(  plm) Querying component [slurm]
[puente:222235] mca:base:select:(  plm) Querying component [ssh]
[puente:222235] [[INVALID],UNDEFINED] plm:ssh_lookup on agent ssh : rsh path NULL
[puente:222235] mca:base:select:(  plm) Query of component [ssh] set priority to 10
[puente:222235] mca:base:select:(  plm) Selected component [ssh]
[puente:222235] mca: base: close: component slurm closed
[puente:222235] mca: base: close: unloading component slurm
[puente:222235] [prterun-puente-222235@0,0] plm:ssh_setup on agent ssh : rsh path NULL
[puente:222235] [prterun-puente-222235@0,0] plm:base:receive start comm
[puente:222235] mca: base: component_find: searching NULL for rmaps components
[puente:222235] mca: base: find_dyn_components: checking NULL for rmaps components
[puente:222235] pmix:mca: base: components_register: registering framework rmaps components
[puente:222235] pmix:mca: base: components_register: found loaded component ppr
[puente:222235] pmix:mca: base: components_register: component ppr register function successful
[puente:222235] pmix:mca: base: components_register: found loaded component rank_file
[puente:222235] pmix:mca: base: components_register: component rank_file has no register or open function
[puente:222235] pmix:mca: base: components_register: found loaded component round_robin
[puente:222235] pmix:mca: base: components_register: component round_robin register function successful
[puente:222235] pmix:mca: base: components_register: found loaded component seq
[puente:222235] pmix:mca: base: components_register: component seq register function successful
[puente:222235] mca: base: components_open: opening rmaps components
[puente:222235] mca: base: components_open: found loaded component ppr
[puente:222235] mca: base: components_open: component ppr open function successful
[puente:222235] mca: base: components_open: found loaded component rank_file
[puente:222235] mca: base: components_open: found loaded component round_robin
[puente:222235] mca: base: components_open: component round_robin open function successful
[puente:222235] mca: base: components_open: found loaded component seq
[puente:222235] mca: base: components_open: component seq open function successful
[puente:222235] mca:rmaps:select: checking available component ppr
[puente:222235] mca:rmaps:select: Querying component [ppr]
[puente:222235] mca:rmaps:select: checking available component rank_file
[puente:222235] mca:rmaps:select: Querying component [rank_file]
[puente:222235] mca:rmaps:select: checking available component round_robin
[puente:222235] mca:rmaps:select: Querying component [round_robin]
[puente:222235] mca:rmaps:select: checking available component seq
[puente:222235] mca:rmaps:select: Querying component [seq]
[puente:222235] [prterun-puente-222235@0,0]: Final mapper priorities
[puente:222235] 	Mapper: rank_file Priority: 100
[puente:222235] 	Mapper: ppr Priority: 90
[puente:222235] 	Mapper: seq Priority: 60
[puente:222235] 	Mapper: round_robin Priority: 10
[puente:222235] [prterun-puente-222235@0,0] plm:base:setup_vm
[puente:222235] [prterun-puente-222235@0,0] plm:base:setup_vm creating map
[puente:222235] [prterun-puente-222235@0,0] setup:vm: working unmanaged allocation
[puente:222235] [prterun-puente-222235@0,0] using hostfile /home/rene/nccl-tests/hostfile

[puente:222235] [prterun-puente-222235@0,0] checking node 146.155.155.83
======================   ALLOCATED NODES   ======================
[puente:222235] [prterun-puente-222235@0,0] ignoring myself
    puente: slots=8 max_slots=0 slots_inuse=0 state=UP
[puente:222235] [prterun-puente-222235@0,0] checking node 146.155.155.84
	Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN
[puente:222235] [prterun-puente-222235@0,0] plm:base:setup_vm add new daemon [prterun-puente-222235@0,1]
	aliases: 146.155.155.83
[puente:222235] [prterun-puente-222235@0,0] plm:base:setup_vm assigning new daemon [prterun-puente-222235@0,1] to node 146.155.155.84
    146.155.155.84: slots=8 max_slots=0 slots_inuse=0 state=UNKNOWN
	Flags: SLOTS_GIVEN
	aliases: NONE
[puente:222235] [prterun-puente-222235@0,0] plm:ssh: launching vm
=================================================================
[puente:222235] [prterun-puente-222235@0,0] plm:ssh: local shell: 0 (bash)
[puente:222235] [prterun-puente-222235@0,0] plm:ssh: assuming same remote shell as local shell
[puente:222235] [prterun-puente-222235@0,0] plm:ssh: remote shell: 0 (bash)
[puente:222235] [prterun-puente-222235@0,0] plm:ssh: final template argv:
	/usr/bin/ssh <template> PRTE_PREFIX=/usr/local/openmpi;export PRTE_PREFIX;LD_LIBRARY_PATH=/usr/local/openmpi/lib:/usr/local/openmpi/lib:$LD_LIBRARY_PATH;export LD_LIBRARY_PATH;DYLD_LIBRARY_PATH=/usr/local/openmpi/lib:/usr/local/openmpi/lib:$DYLD_LIBRARY_PATH;export DYLD_LIBRARY_PATH;/usr/local/openmpi/bin/prted --prtemca ess "env" --prtemca ess_base_nspace "prterun-puente-222235@0" --prtemca ess_base_vpid "<template>" --prtemca ess_base_num_procs "2" --prtemca prte_hnp_uri "[email protected];tcp://146.155.155.83:39027:28" --prtemca plm_base_verbose "100" --prtemca rmaps_base_verbose "100" --prtemca pmix_session_server "1" --prtemca plm "ssh" --tree-spawn --prtemca prte_parent_uri "[email protected];tcp://146.155.155.83:39027:28"
[puente:222235] [prterun-puente-222235@0,0] plm:ssh:launch daemon 0 not a child of mine
[puente:222235] [prterun-puente-222235@0,0] plm:ssh: adding node 146.155.155.84 to launch list
[puente:222235] [prterun-puente-222235@0,0] plm:ssh: activating launch event
[puente:222235] [prterun-puente-222235@0,0] plm:ssh: recording launch of daemon [prterun-puente-222235@0,1]
[puente:222235] [prterun-puente-222235@0,0] plm:ssh: executing: (/usr/bin/ssh) [/usr/bin/ssh 146.155.155.84 PRTE_PREFIX=/usr/local/openmpi;export PRTE_PREFIX;LD_LIBRARY_PATH=/usr/local/openmpi/lib:/usr/local/openmpi/lib:$LD_LIBRARY_PATH;export LD_LIBRARY_PATH;DYLD_LIBRARY_PATH=/usr/local/openmpi/lib:/usr/local/openmpi/lib:$DYLD_LIBRARY_PATH;export DYLD_LIBRARY_PATH;/usr/local/openmpi/bin/prted --prtemca ess "env" --prtemca ess_base_nspace "prterun-puente-222235@0" --prtemca ess_base_vpid 1 --prtemca ess_base_num_procs "2" --prtemca prte_hnp_uri "[email protected];tcp://146.155.155.83:39027:28" --prtemca plm_base_verbose "100" --prtemca rmaps_base_verbose "100" --prtemca pmix_session_server "1" --prtemca plm "ssh" --tree-spawn --prtemca prte_parent_uri "[email protected];tcp://146.155.155.83:39027:28"]
[puente:222235] [prterun-puente-222235@0,0] plm:base:orted_report_launch from daemon [prterun-puente-222235@0,1]
[puente:222235] [prterun-puente-222235@0,0] plm:base:orted_report_launch from daemon [prterun-puente-222235@0,1] on node kalila
[puente:222235] ALIASES FOR NODE kalila (kalila)
[puente:222235] 	ALIAS: 146.155.155.84
[puente:222235] [prterun-puente-222235@0,0] RECEIVED TOPOLOGY SIG 2N:2S:16L3:128L2:128L1:128C:255H:0-254:0-255:x86_64:le FROM NODE kalila
[puente:222235] [prterun-puente-222235@0,0] NEW TOPOLOGY - ADDING SIGNATURE
[puente:222235] [prterun-puente-222235@0,0] plm:base:orted_report_launch completed for daemon [prterun-puente-222235@0,1] at contact [email protected];tcp://146.155.155.84:33749:28
[puente:222235] [prterun-puente-222235@0,0] plm:base:orted_report_launch job prterun-puente-222235@0 recvd 2 of 2 reported daemons

======================   ALLOCATED NODES   ======================
    puente: slots=8 max_slots=0 slots_inuse=0 state=UP
	Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN
	aliases: 146.155.155.83
    kalila: slots=8 max_slots=0 slots_inuse=0 state=UP
	Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN
	aliases: 146.155.155.84
=================================================================
[puente:222235] [prterun-puente-222235@0,0] rmaps:base set policy with ppr:1:node
[puente:222235] [prterun-puente-222235@0,0] rmaps:base policy ppr modifiers 1:node provided
[puente:222235] [prterun-puente-222235@0,0] plm:base:receive processing msg
[puente:222235] [prterun-puente-222235@0,0] plm:base:receive job launch command from [prterun-puente-222235@0,0]
[puente:222235] [prterun-puente-222235@0,0] plm:base:receive adding hosts
[puente:222235] [prterun-puente-222235@0,0] plm:base:receive calling spawn
[puente:222235] [prterun-puente-222235@0,0] plm:base:receive done processing commands
[puente:222235] [prterun-puente-222235@0,0] plm:base:setup_job
[puente:222235] [prterun-puente-222235@0,0] plm:base:setup_vm
[puente:222235] [prterun-puente-222235@0,0] plm_base:setup_vm NODE kalila WAS NOT ADDED
[puente:222235] [prterun-puente-222235@0,0] plm:base:setup_vm no new daemons required
[puente:222235] mca:rmaps: mapping job prterun-puente-222235@1
[puente:222235] mca:rmaps: setting mapping policies for job prterun-puente-222235@1 inherit TRUE hwtcpus FALSE
[puente:222235] setdefaultbinding[366] binding not given - using bycore
[puente:222235] mca:rmaps:rf: job prterun-puente-222235@1 not using rankfile policy
[puente:222235] mca:rmaps:ppr: mapping job prterun-puente-222235@1 with ppr 1:node
[puente:222235] mca:rmaps:ppr: job prterun-puente-222235@1 assigned policy BYNODE:SLOT
[puente:222235] [prterun-puente-222235@0,0] using hostfile /home/rene/nccl-tests/hostfile
[puente:222235] NODE puente DOESNT MATCH NODE 146.155.155.84

[puente:222235] [prterun-puente-222235@0,0] node puente has 8 slots available
======================   ALLOCATED NODES   ======================
[puente:222235] [prterun-puente-222235@0,0] node kalila has 8 slots available
    puente: slots=8 max_slots=0 slots_inuse=0 state=UP
[puente:222235] AVAILABLE NODES FOR MAPPING:
	Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN
[puente:222235]     node: puente daemon: 0 slots_available: 8
	aliases: 146.155.155.83
[puente:222235]     node: kalila daemon: 1 slots_available: 8
    kalila: slots=8 max_slots=0 slots_inuse=0 state=UP
	Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN
	aliases: 146.155.155.84
[puente:222235] [prterun-puente-222235@0,0] get_avail_ncpus: node puente has 0 procs on it
=================================================================
[puente:222235] mca:rmaps: compute bindings for job prterun-puente-222235@1 with policy CORE:IF-SUPPORTED[1007]
[puente:222235] mca:rmaps: bind [prterun-puente-222235@1,INVALID] with policy CORE:IF-SUPPORTED
[puente:222235] [prterun-puente-222235@0,0] BOUND PROC [prterun-puente-222235@1,INVALID][puente] TO package[0][core:0]
[puente:222235] [prterun-puente-222235@0,0] get_avail_ncpus: node kalila has 0 procs on it
[puente:222235] mca:rmaps: compute bindings for job prterun-puente-222235@1 with policy CORE:IF-SUPPORTED[1007]
[puente:222235] mca:rmaps: bind [prterun-puente-222235@1,INVALID] with policy CORE:IF-SUPPORTED
[puente:222235] [prterun-puente-222235@0,0] BOUND PROC [prterun-puente-222235@1,INVALID][kalila] TO package[0][core:0]
[puente:222235] [prterun-puente-222235@0,0] complete_setup on job prterun-puente-222235@1
[puente:222235] [prterun-puente-222235@0,0] plm:base:launch_apps for job prterun-puente-222235@1
[puente:222235] [prterun-puente-222235@0,0] plm:base:send launch msg for job prterun-puente-222235@1
[puente:222235] [prterun-puente-222235@0,0] plm:base:receive processing msg
[puente:222235] [prterun-puente-222235@0,0] plm:base:receive local launch complete command from [prterun-puente-222235@0,1]
[puente:222235] [prterun-puente-222235@0,0] plm:base:receive got local launch complete for job prterun-puente-222235@1
[puente:222235] [prterun-puente-222235@0,0] plm:base:receive got local launch complete for vpid 1
[puente:222235] [prterun-puente-222235@0,0] plm:base:receive got local launch complete for vpid 1 state RUNNING
[puente:222235] [prterun-puente-222235@0,0] plm:base:receive done processing commands
[puente:222235] [prterun-puente-222235@0,0] plm:base:launch wiring up iof for job prterun-puente-222235@1
[puente:222235] [prterun-puente-222235@0,0] plm:base:receive processing msg
[puente:222235] [prterun-puente-222235@0,0] plm:base:receive registered command from [prterun-puente-222235@0,1]
[puente:222235] [prterun-puente-222235@0,0] plm:base:receive got registered for job prterun-puente-222235@1
[puente:222235] [prterun-puente-222235@0,0] plm:base:receive got registered for vpid 1
[puente:222235] [prterun-puente-222235@0,0] plm:base:receive done processing commands
[puente:222235] [prterun-puente-222235@0,0] plm:base:launch prterun-puente-222235@1 registered
Hello world from rank 1 out of 2 processors

from ompi.

janjust avatar janjust commented on August 17, 2024

I'm not really sure how to debug this further - I cannot reproduce this locally or on any of our other machines

from ompi.

janjust avatar janjust commented on August 17, 2024

@sdonoso just curious is this specific to v5.0.x, does it work with v4.1.x?

from ompi.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.