azure / cyclecloud-slurm Goto Github PK
View Code? Open in Web Editor NEWAzure CycleCloud project to enable users to create, configure, and use Slurm HPC clusters.
License: MIT License
Azure CycleCloud project to enable users to create, configure, and use Slurm HPC clusters.
License: MIT License
With the currently supported version of Slurm, the slurmrestd
service can be built by the addition of a few dependencies in the rpmbuild stage and the addition of an rpmbuild flag:
RUN yum install -y http-parser-devel json-c-devel
RUN rpmbuild --define "_with_slurmrestd 1" -ta ${SLURM_PKG}
How this would then be added to chef/cloud-init is unfortunately beyond me!
There are also several cloud focused developments (such as burstbuffer/lua for staging of files which is useful when using cloud compute resources) and commonly required features (job script capturing, parsable account queries, etc) which are unavailable due to them only being available in Slurm 21.08.
Is there a timeline for when the slurm version will be bumped to the latest, and is there any scope for enabling additional features like slurmrestd
?
I am using cyclecloud with a SLURM queueing system in combination with dask
(a Python package that manages the bookkeeping and task distribution such that one doesn't have to write jobscripts and collect the data.)
It's possible to use that adaptive scaling feature of dask
which means, that as the load on all my nodes becomes high, it automatically creates new jobs using (in my case) the following job script:
#!/bin/bash
#!/usr/bin/env bash
#SBATCH -J dask-worker
#SBATCH -e TMPDIR/dask-worker-%J.err
#SBATCH -o TMPDIR/dask-worker-%J.out
#SBATCH -p debug
#SBATCH -A WAL
#SBATCH -n 1
#SBATCH --cpus-per-task=1
#SBATCH --mem=15G
#SBATCH -t 240:00:00
JOB_ID=${SLURM_JOB_ID%;*}
export MKL_NUM_THREADS=1
export OPENBLAS_NUM_THREADS=1
export OMP_NUM_THREADS=1
/gscratch/home/t-banij/miniconda3/envs/py37_min/bin/python -m distributed.cli.dask_worker tcp://10.75.64.5:43267 --nthreads 1 --memory-limit 16.00GB --name dask-worker--${JOB_ID}-- --death-timeout 60 --local-directory TMPDIR
This results in many PENDING
jobs. Unfortunately, no new nodes are started automatically, I still need to go into the web interface and start new nodes by hand, rendering the autoscale feature I am using useless.
Nodes that do not start within ResumeTimeout (default is 10 minutes) enter the down~
state and will not come out of this. I recommend enabling return_to_idle.sh
on all versions of Slurm, not just <= 18.
the following files need to be updated to recognize RHEL images:
slurm/recipes/execute.rb
Line 68: change to when 'centos', 'rhel'
slurm/recipes/default.rb
Line 89: change to when 'centos', 'rhel'
slurm/attributes/default.rb
Line 11: change to when 'centos', 'rhel'
Hello,
it is common to use gres for GPU nodes. Such a setting should be added to the "pre-built" cyclecloud.conf configuration.
Thanks
CycleCloud Version - 8.1.0-1275
Slurm - 19.05.8-1
Scenario:
{{{
sinfo doesn't show up new added node and it seems slurmctld stuck in restart:
[root@ip-0A060009 slurm]# sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
hpc* up infinite 2 alloc hpc-pg0-[1-2]
htc up infinite 2 idle~ htc-[1-2]
[root@ip-0A060009 slurm]# systemctl status slurmctld
● slurmctld.service - Slurm controller daemon
Loaded: loaded (/usr/lib/systemd/system/slurmctld.service; enabled; vendor preset: disabled)
Drop-In: /etc/systemd/system/slurmctld.service.d
└─override.conf
Active: failed (Result: exit-code) since Thu 2021-02-18 20:42:28 UTC; 3s ago
Process: 11278 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS (code=exited, status=0/SUCCESS)
Main PID: 11280 (code=exited, status=1/FAILURE)
Feb 18 20:42:28 ip-0A060009 systemd[1]: Starting Slurm controller daemon...
Feb 18 20:42:28 ip-0A060009 systemd[1]: Started Slurm controller daemon.
Feb 18 20:42:28 ip-0A060009 systemd[1]: slurmctld.service: main process exited, code=exited, status=1/FAILURE
Feb 18 20:42:28 ip-0A060009 systemd[1]: Unit slurmctld.service entered failed state.
Feb 18 20:42:28 ip-0A060009 systemd[1]: slurmctld.service failed.
}}}
the default slurm.conf uses AccountingStorageHost="localhost"
instead of the resolvable name of the slurmdbd host (ie. scheduler
)
There are occasions where slurmd times out checking in with slurmctld due to high uptime before slurmd starts. Adding -b
to the startup of slurmd works around the issue. We should include it by default.
Slurm proj: 2.4.4
Slurm ver: 20.11.4-1
CC ver: 8.1.0-1275
A Slurm cluster connected to MariaDB service in Azure will fail to restart cleanly when terminiated/restarted. The following error is seen:
================================================================================
Error executing action `run` on resource 'bash[Add cluster to slurmdbd]'
================================================================================
Mixlib::ShellOut::ShellCommandFailed
------------------------------------
Expected process to exit with [0], but received '1'
---- Begin output of "bash" "/tmp/chef-script20210401-1390-h8kq8x" ----
STDOUT: This cluster ucla-slurm-docker-grafana already exists. Not adding.
STDERR:
---- End output of "bash" "/tmp/chef-script20210401-1390-h8kq8x" ----
Ran "bash" "/tmp/chef-script20210401-1390-h8kq8x" returned 1
Resource Declaration:
---------------------
# In /opt/cycle/jetpack/system/chef/cache/cookbooks/slurm/recipes/accounting.rb
77: bash 'Add cluster to slurmdbd' do
78: code <<-EOH
79: sacctmgr -i add cluster #{clustername} && touch /etc/slurmdbd.configured
80: EOH
81: not_if { ::File.exist?('/etc/slurmdbd.configured') }
82: not_if "sleep 5 && sacctmgr show cluster | grep #{clustername}"
83: end
84:
Compiled Resource:
------------------
# Declared in /opt/cycle/jetpack/system/chef/cache/cookbooks/slurm/recipes/accounting.rb:77:in `from_file'
bash("Add cluster to slurmdbd") do
action [:run]
default_guard_interpreter :default
command nil
backup 5
returns 0
user nil
interpreter "bash"
declared_type :bash
cookbook_name "slurm"
recipe_name "accounting"
code " sacctmgr -i add cluster UCLA-slurm-docker-grafana && touch /etc/slurmdbd.configured \n"
domain nil
not_if { #code block }
not_if "sleep 5 && sacctmgr show cluster | grep UCLA-slurm-docker-grafana"
end
System Info:
------------
chef_version=13.12.14
platform=centos
platform_version=7.7.1908
ruby=ruby 2.5.7p206 (2019-10-01 revision 67816) [x86_64-linux]
program_name=chef-solo worker: ppid=1385;start=17:48:00;
executable=/opt/cycle/jetpack/system/embedded/bin/chef-solo
Running handlers:
- CycleCloud::ExceptionHandler
- CycleCloud::ExceptionHandler
Running handlers complete
Chef Client failed. 9 resources updated in 16 seconds
Error: A problem occurred while running Chef, check chef-client.log for details
On the scheduler node I see the cluster already exists in sacctmgr:
[cyclecloud@glinc5zuux6 ~]$ sacctmgr list cluster
Cluster ControlHost ControlPort RPC Share GrpJobs GrpTRES GrpSubmit MaxJobs MaxTRES MaxSubmit MaxWall QOS Def QOS
---------- --------------- ------------ ----- --------- ------- ------------- --------- ------- ------------- --------- ----------- -------------------- ---------
ucla-slur+ 0 9216 1 normal
I suspect scheduler.rb line 77 needs a check if cluster exists before trying to add the cluster.
Hello,
There are a bunch of files listed in the files list presented in the project.ini.
I understand that Github is not an object storage and hence the files are not in the repository. Could someone point me to the files ?
Hi, it seems that autoscaling no longer works with Centos8
Tested with :
Cyclecloud Version: 8.1.0-1275
Cyclecloud-Slurm 2.4.2
Results:
Centos8 + Slurm - 20.11.0-1 = No Autoscaling
Centos7 + Slurm - 20.11.0-1 = Autoscale works
[root@ip-0A781804 slurmctld]# cat slurmctld.log
[2020-12-18T00:15:27.016] debug: Log file re-opened
[2020-12-18T00:15:27.020] debug: creating clustername file: /var/spool/slurmd/clustername
[2020-12-18T00:15:27.021] error: Configured MailProg is invalid
[2020-12-18T00:15:27.021] slurmctld version 20.11.0 started on cluster asdasd
[2020-12-18T00:15:27.021] cred/munge: init: Munge credential signature plugin loaded
[2020-12-18T00:15:27.021] debug: auth/munge: init: Munge authentication plugin loaded
[2020-12-18T00:15:27.021] select/cons_res: common_init: select/cons_res loaded
[2020-12-18T00:15:27.021] select/cons_tres: common_init: select/cons_tres loaded
[2020-12-18T00:15:27.021] select/cray_aries: init: Cray/Aries node selection plugin loaded
[2020-12-18T00:15:27.021] select/linear: init: Linear node selection plugin loaded with argument 20
[2020-12-18T00:15:27.021] preempt/none: init: preempt/none loaded
[2020-12-18T00:15:27.021] debug: acct_gather_energy/none: init: AcctGatherEnergy NONE plugin loaded
[2020-12-18T00:15:27.021] debug: acct_gather_Profile/none: init: AcctGatherProfile NONE plugin loaded
[2020-12-18T00:15:27.021] debug: acct_gather_interconnect/none: init: AcctGatherInterconnect NONE plugin loaded
[2020-12-18T00:15:27.021] debug: acct_gather_filesystem/none: init: AcctGatherFilesystem NONE plugin loaded
[2020-12-18T00:15:27.022] debug: jobacct_gather/none: init: Job accounting gather NOT_INVOKED plugin loaded
[2020-12-18T00:15:27.022] ext_sensors/none: init: ExtSensors NONE plugin loaded
[2020-12-18T00:15:27.022] debug: switch/none: init: switch NONE plugin loaded
[2020-12-18T00:15:27.022] debug: shutting down backup controllers (my index: 0)
[2020-12-18T00:15:27.022] accounting_storage/none: init: Accounting storage NOT INVOKED plugin loaded
[2020-12-18T00:15:27.022] debug: create_mmap_buf: Failed to open file `/var/spool/slurmd/assoc_usage`, No such file or directory
[2020-12-18T00:15:27.022] debug: Reading slurm.conf file: /etc/slurm/slurm.conf
[2020-12-18T00:15:27.023] debug: NodeNames=hpc-pg0-[1-4] setting Sockets=60 based on CPUs(60)/(CoresPerSocket(1)/ThreadsPerCore(1))
[2020-12-18T00:15:27.023] debug: NodeNames=htc-[1-5] setting Sockets=60 based on CPUs(60)/(CoresPerSocket(1)/ThreadsPerCore(1))
[2020-12-18T00:15:27.023] debug: Reading cgroup.conf file /etc/slurm/cgroup.conf
[2020-12-18T00:15:27.023] topology/tree: init: topology tree plugin loaded
[2020-12-18T00:15:27.023] debug: No DownNodes
[2020-12-18T00:15:27.023] debug: create_mmap_buf: Failed to open file `/var/spool/slurmd/last_config_lite`, No such file or directory
[2020-12-18T00:15:27.140] debug: Log file re-opened
[2020-12-18T00:15:27.141] sched: Backfill scheduler plugin loaded
[2020-12-18T00:15:27.141] debug: topology/tree: _read_topo_file: Reading the topology.conf file
[2020-12-18T00:15:27.141] topology/tree: _validate_switches: TOPOLOGY: warning -- no switch can reach all nodes through its descendants. If this is not intentional, fix the topology.conf file.
[2020-12-18T00:15:27.141] debug: topology/tree: _log_switches: Switch level:0 name:hpc-Standard_HB60rs-pg0 nodes:hpc-pg0-[1-4] switches:(null)
[2020-12-18T00:15:27.141] debug: topology/tree: _log_switches: Switch level:0 name:htc nodes:htc-[1-5] switches:(null)
[2020-12-18T00:15:27.141] route/default: init: route default plugin loaded
[2020-12-18T00:15:27.141] debug: create_mmap_buf: Failed to open file `/var/spool/slurmd/node_state`, No such file or directory
[2020-12-18T00:15:27.141] error: Could not open node state file /var/spool/slurmd/node_state: No such file or directory
[2020-12-18T00:15:27.141] error: NOTE: Trying backup state save file. Information may be lost!
[2020-12-18T00:15:27.141] debug: create_mmap_buf: Failed to open file `/var/spool/slurmd/node_state.old`, No such file or directory
[2020-12-18T00:15:27.141] No node state file (/var/spool/slurmd/node_state.old) to recover
[2020-12-18T00:15:27.141] debug: create_mmap_buf: Failed to open file `/var/spool/slurmd/job_state`, No such file or directory
[2020-12-18T00:15:27.141] error: Could not open job state file /var/spool/slurmd/job_state: No such file or directory
[2020-12-18T00:15:27.141] error: NOTE: Trying backup state save file. Jobs may be lost!
[2020-12-18T00:15:27.141] debug: create_mmap_buf: Failed to open file `/var/spool/slurmd/job_state.old`, No such file or directory
[2020-12-18T00:15:27.142] No job state file (/var/spool/slurmd/job_state.old) to recover
[2020-12-18T00:15:27.142] select/cons_tres: part_data_create_array: select/cons_tres: preparing for 2 partitions
[2020-12-18T00:15:27.142] debug: gres/gpu: init: loaded
[2020-12-18T00:15:27.142] debug: gpu/generic: init: init: GPU Generic plugin loaded
[2020-12-18T00:15:27.142] debug: gres/gpu: init: loaded
[2020-12-18T00:15:27.142] debug: gres/gpu: init: loaded
[2020-12-18T00:15:27.142] debug: gres/gpu: init: loaded
[2020-12-18T00:15:27.142] debug: gres/gpu: init: loaded
[2020-12-18T00:15:27.142] debug: gres/gpu: init: loaded
[2020-12-18T00:15:27.142] debug: gres/gpu: init: loaded
[2020-12-18T00:15:27.142] debug: gres/gpu: init: loaded
[2020-12-18T00:15:27.142] debug: gres/gpu: init: loaded
[2020-12-18T00:15:27.142] debug: Updating partition uid access list
[2020-12-18T00:15:27.142] debug: create_mmap_buf: Failed to open file `/var/spool/slurmd/resv_state`, No such file or directory
[2020-12-18T00:15:27.143] error: Could not open reservation state file /var/spool/slurmd/resv_state: No such file or directory
[2020-12-18T00:15:27.143] error: NOTE: Trying backup state save file. Reservations may be lost
[2020-12-18T00:15:27.143] debug: create_mmap_buf: Failed to open file `/var/spool/slurmd/resv_state.old`, No such file or directory
[2020-12-18T00:15:27.143] No reservation state file (/var/spool/slurmd/resv_state.old) to recover
[2020-12-18T00:15:27.143] debug: create_mmap_buf: Failed to open file `/var/spool/slurmd/trigger_state`, No such file or directory
[2020-12-18T00:15:27.143] error: Could not open trigger state file /var/spool/slurmd/trigger_state: No such file or directory
[2020-12-18T00:15:27.143] error: NOTE: Trying backup state save file. Triggers may be lost!
[2020-12-18T00:15:27.143] debug: create_mmap_buf: Failed to open file `/var/spool/slurmd/trigger_state.old`, No such file or directory
[2020-12-18T00:15:27.143] No trigger state file (/var/spool/slurmd/trigger_state.old) to recover
[2020-12-18T00:15:27.143] read_slurm_conf: backup_controller not specified
[2020-12-18T00:15:27.143] Reinitializing job accounting state
[2020-12-18T00:15:27.143] select/cons_tres: select_p_reconfigure: select/cons_tres: reconfigure
[2020-12-18T00:15:27.143] select/cons_tres: part_data_create_array: select/cons_tres: preparing for 2 partitions
[2020-12-18T00:15:27.143] Running as primary controller
[2020-12-18T00:15:27.143] debug: No backup controllers, not launching heartbeat.
[2020-12-18T00:15:27.143] debug: priority/basic: init: Priority BASIC plugin loaded
[2020-12-18T00:15:27.143] No parameter for mcs plugin, default values set
[2020-12-18T00:15:27.143] mcs: MCSParameters = (null). ondemand set.
[2020-12-18T00:15:27.143] debug: mcs/none: init: mcs none plugin loaded
[2020-12-18T00:15:57.143] debug: sched/backfill: _attempt_backfill: beginning
[2020-12-18T00:15:57.143] debug: sched/backfill: _attempt_backfill: no jobs to backfill
[2020-12-18T00:16:27.212] SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2
[2020-12-18T00:16:27.212] debug: sched: Running job scheduler
[2020-12-18T00:17:27.284] debug: sched: Running job scheduler
[2020-12-18T00:17:27.285] debug: shutting down backup controllers (my index: 0)
[2020-12-18T00:18:27.362] debug: sched: Running job scheduler
[2020-12-18T00:19:27.438] debug: sched: Running job scheduler
[2020-12-18T00:19:27.438] debug: shutting down backup controllers (my index: 0)
[2020-12-18T00:20:27.512] debug: sched: Running job scheduler
[2020-12-18T00:20:27.513] debug: create_mmap_buf: Failed to open file `/var/spool/slurmd/job_state`, No such file or directory
[2020-12-18T00:20:27.513] error: Could not open job state file /var/spool/slurmd/job_state: No such file or directory
[2020-12-18T00:20:27.513] error: NOTE: Trying backup state save file. Jobs may be lost!
[2020-12-18T00:20:27.513] debug: create_mmap_buf: Failed to open file `/var/spool/slurmd/job_state.old`, No such file or directory
[2020-12-18T00:20:27.513] No job state file (/var/spool/slurmd/job_state.old) found
[2020-12-18T00:21:27.676] debug: sched: Running job scheduler
[2020-12-18T00:21:27.676] debug: shutting down backup controllers (my index: 0)
[2020-12-18T00:22:27.749] debug: sched: Running job scheduler
[2020-12-18T00:23:27.821] debug: sched: Running job scheduler
[2020-12-18T00:23:27.821] debug: shutting down backup controllers (my index: 0)
[2020-12-18T00:24:27.892] debug: sched: Running job scheduler
[2020-12-18T00:25:27.966] debug: Updating partition uid access list
[2020-12-18T00:25:27.966] debug: sched: Running job scheduler
[2020-12-18T00:25:28.067] debug: shutting down backup controllers (my index: 0)
[2020-12-18T00:26:27.139] debug: sched: Running job scheduler
[2020-12-18T00:27:27.212] debug: sched: Running job scheduler
[2020-12-18T00:27:28.214] debug: shutting down backup controllers (my index: 0)
[2020-12-18T00:28:27.286] debug: sched: Running job scheduler
[2020-12-18T00:29:27.361] debug: sched: Running job scheduler
[2020-12-18T00:29:28.362] debug: shutting down backup controllers (my index: 0)
[2020-12-18T00:30:27.437] debug: sched: Running job scheduler
[2020-12-18T00:30:55.711] req_switch=-2 network='(null)'
[2020-12-18T00:30:55.711] Setting reqswitch to 1.
[2020-12-18T00:30:55.711] returning.
[2020-12-18T00:30:55.712] sched: _slurm_rpc_allocate_resources JobId=2 NodeList=htc-1 usec=1268
[2020-12-18T00:30:56.261] debug: sched/backfill: _attempt_backfill: beginning
[2020-12-18T00:30:56.261] debug: sched/backfill: _attempt_backfill: no jobs to backfill
[2020-12-18T00:30:57.263] error: power_save: program exit status of 1
[2020-12-18T00:31:27.588] debug: sched: Running job scheduler
[2020-12-18T00:31:28.589] debug: shutting down backup controllers (my index: 0)
[root@ip-0A781804 slurmctld]# squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
2 htc hostname andreim CF 1:37 1 htc-1
[root@ip-0A781804 slurmctld]# sinfo -V
slurm 20.11.0
[root@ip-0A781804 slurmctld]# systemctl status slurmctld.service
● slurmctld.service - Slurm controller daemon
Loaded: loaded (/usr/lib/systemd/system/slurmctld.service; enabled; vendor preset: disabled)
Drop-In: /etc/systemd/system/slurmctld.service.d
└─override.conf
Active: active (running) since Fri 2020-12-18 00:15:26 UTC; 17min ago
Main PID: 3980 (slurmctld)
Tasks: 8
Memory: 5.5M
CGroup: /system.slice/slurmctld.service
└─3980 /usr/sbin/slurmctld -D
Dec 18 00:15:26 ip-0A781804 systemd[1]: Started Slurm controller daemon.
[root@ip-0A781804 slurm]# cat topology.conf
SwitchName=hpc-Standard_HB60rs-pg0 Nodes=hpc-pg0-[1-4]
SwitchName=htc Nodes=htc-[1-5]
[root@ip-0A781804 slurm]# cat slurm.conf
MpiDefault=none
ProctrackType=proctrack/cgroup
ReturnToService=2
PropagateResourceLimits=ALL
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmdPidFile=/var/run/slurmd.pid
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser="slurm"
StateSaveLocation=/var/spool/slurmd
SwitchType=switch/none
TaskPlugin=task/affinity,task/cgroup
SchedulerType=sched/backfill
SelectType=select/cons_tres
GresTypes=gpu
SelectTypeParameters=CR_Core_Memory
ClusterName="ASDASD"
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=debug
SlurmctldLogFile=/var/log/slurmctld/slurmctld.log
SlurmctldParameters=idle_on_node_suspend
SlurmdDebug=debug
SlurmdLogFile=/var/log/slurmd/slurmd.log
TopologyPlugin=topology/tree
JobSubmitPlugins=job_submit/cyclecloud
PrivateData=cloud
TreeWidth=65533
ResumeTimeout=1800
SuspendTimeout=600
SuspendTime=300
ResumeProgram=/opt/cycle/jetpack/system/bootstrap/slurm/resume_program.sh
ResumeFailProgram=/opt/cycle/jetpack/system/bootstrap/slurm/resume_fail_program.sh
SuspendProgram=/opt/cycle/jetpack/system/bootstrap/slurm/suspend_program.sh
SchedulerParameters=max_switch_wait=24:00:00
AccountingStorageType=accounting_storage/none
Include cyclecloud.conf
SlurmctldHost=ip-0A781804
[root@ip-0A781804 slurm]# cat cyclecloud.conf
# Note: CycleCloud reported a RealMemory of 228884 but we reduced it by 11444 (i.e. max(1gb, 5%)) to account for OS/VM overhead which
# would result in the nodes being rejected by Slurm if they report a number less than defined here.
# To pick a different percentage to dampen, set slurm.dampen_memory=X in the nodearray's Configuration where X is percentage (5 = 5%).
PartitionName=hpc Nodes=hpc-pg0-[1-4] Default=YES DefMemPerCPU=3624 MaxTime=INFINITE State=UP
Nodename=hpc-pg0-[1-4] Feature=cloud STATE=CLOUD CPUs=60 CoresPerSocket=1 RealMemory=217440
# Note: CycleCloud reported a RealMemory of 228884 but we reduced it by 11444 (i.e. max(1gb, 5%)) to account for OS/VM overhead which
# would result in the nodes being rejected by Slurm if they report a number less than defined here.
# To pick a different percentage to dampen, set slurm.dampen_memory=X in the nodearray's Configuration where X is percentage (5 = 5%).
PartitionName=htc Nodes=htc-[1-5] Default=NO DefMemPerCPU=3624 MaxTime=INFINITE State=UP
Nodename=htc-[1-5] Feature=cloud STATE=CLOUD CPUs=60 CoresPerSocket=1 RealMemory=217440
When specifying -N <num_nodes> on a job, CycleCloud doesn't spin up the correct number of nodes and the job never runs. This is needed for MPI jobs wanting to use partial nodes.
Hello,
I have configured a partition with nodetype Standard_F32s_v2 (or different CPUs).
Cyclecloud-slurm.sh (v2.5.1) creates an "invalid configuration" compared to the output of "slurmd -C" (v20.11.8).
cyclecloud:
Feature=cloud STATE=CLOUD CPUs=16 ThreadsPerCore=2 RealMemory=62914
slurmd -C
CPUs=32 Boards=1 SocketsPerBoard=1 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=64401
Hi,
how do I migrate my template based deployment from version 2.0.4 to 2.1.0?
Thanks
Execute nodes should check for any idle jobs they're eligible before shutting down.
Hello,
OpenMPI was recently upgraded from version 4.0.5 to 4.1.0 on CycleCloud. Since the upgrade I'm having issues using non-blocking communication with Slurm on CycleCloud.
First, I have to use -mca ^hcoll
to avoid warnings regarding InfiniBand, which F2s_v2
nodes are not equipped with. I had this issue also with version 4.0.5.
Second, since the recent upgrade to OpenMPI v4.1.0, non-blocking communication has stopped working for me. The code I have worked for OpenMPI v4.0.5.
This is the error I'm getting. I've confirmed that the problem occurs when I call MPI_Isend
. I'm attaching a small example to reproduce this problem below.
Process 1 started
Initiating communication on worker 1
[1622528555.546527] [ip-0A000007:9268 :0] sock.c:259 UCX ERROR connect(fd=30, dest_addr=127.0.0.1:58173) failed: Connection refused
[ip-0A000007:09268] pml_ucx.c:383 Error: ucp_ep_create(proc=0) failed: Destination is unreachable
[ip-0A000007:09268] pml_ucx.c:453 Error: Failed to resolve UCX endpoint for rank 0
[ip-0A000007:09268] *** An error occurred in MPI_Isend
[ip-0A000007:09268] *** reported by process [3673554945,1]
[ip-0A000007:09268] *** on communicator MPI_COMM_WORLD
[ip-0A000007:09268] *** MPI_ERR_OTHER: known error not in list
[ip-0A000007:09268] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[ip-0A000007:09268] *** and potentially your MPI job)
Program code (mpi_isend.c
):
#include <mpi.h>
#include <math.h>
#include <stdio.h>
int main(int argc, char* argv[]) {
MPI_Comm comm;
MPI_Request request;
MPI_Status status;
int myid, master, tag, proc, my_int, p, ierr;
comm = MPI_COMM_WORLD;
ierr = MPI_Init(&argc, &argv); /* starts MPI */
MPI_Comm_rank(comm, &myid); /* get current process id */
MPI_Comm_size(comm, &p); /* get number of processes */
master = 0;
tag = 123; /* set the tag to identify this particular job */
printf("Process %d started", myid);
if(myid == master) {
for (proc=1;proc<p;proc++) {
MPI_Recv(
&my_int, 1, MPI_FLOAT, /* triplet of buffer, size, data type */
MPI_ANY_SOURCE, /* message source */
MPI_ANY_TAG, /* message tag */
comm, &status); /* status identifies source, tag */
printf("Received from 1 worker");
}
printf("Master finished");
} else {
printf("Initiating communication on worker %d", myid);
MPI_Isend( /* non-blocking send */
&my_int, 1, MPI_FLOAT, /* triplet of buffer, size, data type */
master,
tag,
comm,
&request); /* send my_int to master */
MPI_Wait(&request, &status); /* block until Isend is done */
printf("Worker %d finished", myid);
}
MPI_Finalize(); /* let MPI finish up ... */
}
Jobfile (mpi_isend.job
):
#!/bin/sh -l
#SBATCH --job-name=pool
#SBATCH --output=pool.out
#SBATCH --nodes=2
#SBATCH --time=600:00
#SBATCH --tasks-per-node=1
#SBATCH --partition=hpc
mpirun -mca ^hcoll mpi_isend
Steps to reproduce:
mpicc mpi_isend.c -o mpi_isend
sbatch mpi_isend.job
My current versions are:
CycleCloud: 8.2.0-1616
Slurm: 20.11.7-1
CycleCloud-Slurm: 2.4.7
OS: CentOS Linux release 7.8.2003 (Core)
While upgrading CycleCloud-Slurm to version 2.4.8 I encountered the following error on the scheduler node while starting my cluster. I know this is a prerelease, but just wanted to make you aware of it. Hope this helps :-)
Chef::Mixin::Template::TemplateError: undefined method `[]' for nil:NilClass
Software Configuration
Review local log files on the VM at /opt/cycle/jetpack/logs
Get more help on this issue
Detail:
/opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/mixin/template.rb:163:in rescue in _render_template' /opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/mixin/template.rb:159:in
_render_template'
/opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/mixin/template.rb:147:in render_template' /opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/provider/template/content.rb:76:in
file_for_provider'
/opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/file_content_management/content_base.rb:40:in tempfile' /opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/provider/file.rb:450:in
tempfile'
/opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/provider/file.rb:327:in do_generate_content' /opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/provider/file.rb:140:in
action_create'
/opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/provider/file.rb:152:in action_create_if_missing' /opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/provider.rb:171:in
run_action'
/opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/resource.rb:592:in run_action' /opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/runner.rb:70:in
run_action'
/opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/runner.rb:98:in block (2 levels) in converge' /opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/runner.rb:98:in
each'
/opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/runner.rb:98:in block in converge' /opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/resource_collection/resource_list.rb:94:in
block in execute_each_resource'
/opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/resource_collection/stepable_iterator.rb:114:in call_iterator_block' /opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/resource_collection/stepable_iterator.rb:85:in
step'
/opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/resource_collection/stepable_iterator.rb:103:in iterate' /opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/resource_collection/stepable_iterator.rb:55:in
each_with_index'
/opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/resource_collection/resource_list.rb:92:in execute_each_resource' /opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/runner.rb:97:in
converge'
/opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/client.rb:718:in block in converge' /opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/client.rb:713:in
catch'
/opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/client.rb:713:in converge' /opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/client.rb:752:in
converge_and_save'
/opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/client.rb:286:in run' /opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/application.rb:292:in
block in fork_chef_client'
/opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/application.rb:280:in fork' /opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/application.rb:280:in
fork_chef_client'
/opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/application.rb:245:in block in run_chef_client' /opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/local_mode.rb:44:in
with_server_connectivity'
/opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/application.rb:233:in run_chef_client' /opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/application/client.rb:470:in
sleep_then_run_chef_client'
/opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/application/client.rb:459:in block in interval_run_chef_client' /opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/application/client.rb:458:in
loop'
/opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/application/client.rb:458:in interval_run_chef_client' /opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/application/client.rb:442:in
run_application'
/opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/application.rb:59:in run' /opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/application/solo.rb:225:in
run'
/opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/bin/chef-solo:25:in <top (required)>' /opt/cycle/jetpack/system/embedded/bin/chef-solo:23:in
load'
/opt/cycle/jetpack/system/embedded/bin/chef-solo:23:in `
1 node with this status
When attempting to build clusters with one resource group per cluster, the build will fail if there are any policies enforcing the creation of certain tags on a resource group. Please could you include an option to include arbitrary tags on the resource groups/resources created when deploying a cluster?
The current cyclecloud_slurm does not support either multiple MachineType values per nodearray, nor multiple nodearrays assigned to the same Slurm partition. If multiple values for either are supplied, the python code will take only the first value in the list. Remarks in the partition class definition say that a one-to-one mapping of partition names to nodearrays is required.
Cyclecloud cluster templates themselves support multiple machine type values per nodearray and Slurm supports multiple machine types per partition. The current limitation of one machine type per partition is a function of the Cyclecloud implementation. Users of a cluster would benefit from being able to ask for a number of cores in a single partition and having the scheduler determine which size VM to create.
Hi all,
I am trying to install openmpi with slurm support on cyclecloud:
spack install hpl^openmpi+pmi schedulers=slurm
But I get this error:
'/configure' '--prefix=/mnt/exports/shared/bin/spack/opt/spack/linux-centos7-broadwell/gcc-8.3.1/openmpi-3.1.5-3uku7u7irjzvpxy5uwajd34ksg5vyyim' '--enable-shared' '--with-wrapper-ldflags=' '--with-pmi=/usr' '--enable-static' '--enable-mpi-cxx' '--with-zlib=/mnt/exports/shared/bin/spack/opt/spack/linux-centos7-broadwell/gcc-8.3.1/zlib-1.2.11-7zv5elkj6r5xcrw4mifho5mfhi6wuwpq' '--without-psm' '--without-libfabric' '--without-ucx' '--without-mxm' '--without-verbs' '--without-psm2' '--without-alps' '--without-lsf' '--without-sge' '--with-slurm' '--without-tm' '--without-loadleveler' '--disable-memchecker' '--with-hwloc=/mnt/exports/shared/bin/spack/opt/spack/linux-centos7-broadwell/gcc-8.3.1/hwloc-1.11.11-7tt3mdset5tqvje4uaonjdqm2b3koihp' '--disable-java' '--disable-mpi-java' '--without-cuda' '--enable-cxx-exceptions'
1 error found in build log:
1125 configure: WARNING: /usr/slurm
1126 configure: WARNING: Specified path: /usr
1127 configure: WARNING: OR neither libpmi nor libpmi2 were found under
:
1128 configure: WARNING: /lib
1129 configure: WARNING: /lib64
1130 configure: WARNING: Specified path:
>> 1131 configure: error: Aborting
It is a very similar problem to: aws/aws-parallelcluster#1008
PMI has been split due to this bug: https://bugs.schedmd.com/show_bug.cgi?id=4511
May I ask to add the package slurm-libpmi?
Thanks
Autoscaling after submitting an Slurm Array
type job results in very slow spinning up of the cluster.
This appears to be because even with an arbitrarily large amount of array jobs specified:
Due to the ~5-10 minutes to spin up an instance, this severely limits autoscaling.
A work around is to submit a dummy non-array job requesting a suitable number of nodes.
Even if this job is stuck behind the Array job in the queue, it still results in the cluster spinning up many nodes at once, which the array jobs are then dispatched to.
e.g. to request 100 CPUs worth of nodes
echo '#!/bin/sh' | sbatch -n 100
When we generate or regenerate the slurm.conf, we should add SuspendExcHosts= a node list of all nodes where KeepAlive is true in CycleCloud.
For jobs that can run on or have requested the same node, there's a race condition where the second job to request the node may fail with a communication error.
Working on a workaround to get the job to block until the node is completely up and registered with Slurm. It's not clear yet if this is a bug in Slurm or an issue with the Chef recipe order of operations.
Hi,
as new RHEL - clones take over CentOS lack of bug-compatibility I'd consider AlmaLinux and Rocky Linux as major successor platforms.
Currently this repo only supports centos and rhel as platform strings but should support others, too - especially as those are bug-compatible builds.
I understand though, that current CycleCloud support is limited to ubunto, centos, rhel.
Thanks
The sacct command doesn't work by default on the head node. Accounting is useful for debugging issues. The option to turn it on should be exposed via the cluster template.
I see there is a comment in the template but I was wondering if this is something that is being actively worked on?
This is my first stab at setting up Azure HPC using CycleCloud and Slurm, so forgive me for stupid mistakes...
Also, if this is the wrong place for this - please point me to where I should post support questions...
I have built a simple (default) Slurm cluster using CycleCloud and the nodes start/stop OK, but when I run a simple (hostname) job it just hangs at "Completing".
Debug logging seems to suggest that the Slurm Scheduler node cannot communicate with the nodes:
[azccadmin@SlurmCluster-1-scheduler data]$ srun -N2 -n2 -t00:15:00 -Jgrma-hostname hostname.sh
srun: error: get_addr_info: getaddrinfo() failed: Name or service not known
srun: error: slurm_set_addr: Unable to resolve "slurmcluster-1-hpc-pg0-1"
srun: error: fwd_tree_thread: can't find address for host slurmcluster-1-hpc-pg0-1, check slurm.conf
srun: error: get_addr_info: getaddrinfo() failed: Name or service not known
srun: error: slurm_set_addr: Unable to resolve "slurmcluster-1-hpc-pg0-2"
srun: error: fwd_tree_thread: can't find address for host slurmcluster-1-hpc-pg0-2, check slurm.conf
srun: error: Task launch for StepId=1.0 failed on node slurmcluster-1-hpc-pg0-2: Can't find an address, check slurm.conf
srun: error: Task launch for StepId=1.0 failed on node slurmcluster-1-hpc-pg0-1: Can't find an address, check slurm.conf
srun: error: Application launch failed: Can't find an address, check slurm.conf
srun: Job step aborted
I cannot ping the node from the Scheduler node:
[azccadmin@SlurmCluster-1-scheduler ~]$ ping slurmcluster-1-hpc-pg0-1
ping: slurmcluster-1-hpc-pg0-1: Name or service not known
Slurm is trying to talk to the node as "slurmcluster-1-hpc-pg0-1", but the hostname of the node is actually "slurmcluster-1-slurmcluster-1-hpc-pg0-1"
And I can ping it with this name from the Scheduler node:
[azccadmin@SlurmCluster-1-scheduler data]$ ping slurmcluster-1-slurmcluster-1-hpc-pg0-1
PING slurmcluster-1-slurmcluster-1-hpc-pg0-1.yr5ran05dk5uzlz13kqz0cq4xe.ax.internal.cloudapp.net (192.168.140.6) 56(84) bytes of data.
64 bytes from slurmcluster-1-slurmcluster-1-hpc-pg0-1.internal.cloudapp.net (192.168.140.6): icmp_seq=1 ttl=64 time=1.77 ms
I am using CycleCloud 8.3 - which I note has a fix for Slurm NodeName / Azure Hostname (but this seems to still be an issue)?
Thanks
Gary
SLURM 2.4.8 is compiled with pmix support so MPI libraries will interface correctly with SLURM srun.
It seems that SLURM 2.4.8 expects the pmix libs to be install on the compute nodes at /opt/pmix/v3, but these do not exist and need to be built (e.g cluster-init project).
#!/bin/bash
cd ~/
mkdir -p /opt/pmix/v3
apt install -y libevent-dev
tar xvf $CYCLECLOUD_SPEC_PATH/files/openpmix-3.1.6.tar.gz
cd openpmix-3.1.6
#mkdir -p pmix/build/v3 pmix/install/v3
#cd pmix
#git clone https://github.com/openpmix/openpmix.git source
#cd source/
#git branch -a
#git checkout v3.1
#git pull
./autogen.sh
#cd ../build/v3/
./configure --prefix=/opt/pmix/v3
make -j install >/dev/null
Can there pmix libs be included with SLURM versions built with pmix support (SLURM 2.4.8+)?
When the cluster is scaled (ie. cyclecloud_slurm.sh scale) the topology.conf
file is overwritten. That also overwrites the onprem switch definitions. Is it possible to prevent this or separate the onprem topology from the Cycle topology?
Before a node calls jetpack --shutdown
, it should mark itself as draining to prevent a race conditions where new jobs get placed during the shutdown sequence.
docker run -v $(pwd)/specs/default/cluster-init/files:/source -v $(pwd)/blobs:/root/rpmbuild/RPMS/x86_64 -ti centos:7 /bin/bash /source/00-build-slurm.sh
triggers the error:
/bin/bash: /source/00-build-slurm.sh: Permission denied
because inside docker the privilliges of source dir are the following
[root@11d3f968ca8c /]# ll
total 12
-rw-r--r--. 1 root root 12082 Mar 5 17:36 anaconda-post.log
drwxrwxr-x. 2 1002 1002 49 Apr 11 12:55 source
Per discussion with @anhoward there is a need for a new role for scenario with HA schedulers and Slurm Accounting. When Primary scheduler "fails" the SlurmDBD should be able to run on the HA scheduler.
I've uploaded the version 2.7.1 to the cluster (after downloading the blobs from github). I've modified the template to include a gpu cluster along with parameter definitions corresponding to that cluster. I've selected NC6 as the GPU machine name and started the cluster. Everything starts fine and I'm able to allocate the F32_vs nodes that correspond to the hpc partition. However when allocating the GPU node, the node starts without any error reported in the console, however, in the scheduler, slurm does not appear to recognize this fact and is still stuck on
[hpcadmin@ip-0A030006 ~]$ salloc -p gpu -n 1
salloc: Granted job allocation 3
salloc: Waiting for resource configuration
This is a problem with both the CentOS 7 and Almalinux 8 operating systems. I have the following cloud-init scripts to install singularity in each case. These don't appear to be a problem as any error in the script typically gets reported as an error in the web console.
CentOS 7:
#!/usr/bin/bash
wget https://github.com/sylabs/singularity/releases/download/v3.9.9/singularity-ce-3.9.9-1.el7.x86_64.rpm
sudo yum localinstall -y ./singularity-ce-3.9.9-1.el7.x86_64.rpm
rm ./singularity-ce-3.9.9-1.el7.x86_64.rpm
Almalinux 8
#!/usr/bin/bash
wget https://github.com/sylabs/singularity/releases/download/v3.9.9/singularity-ce-3.9.9-1.el8.x86_64.rpm
sudo yum localinstall -y ./singularity-ce-3.9.9-1.el8.x86_64.rpm
rm ./singularity-ce-3.9.9-1.el8.x86_64.rpm
Hello.
I think there is a small bug in the 2.1.0 release.
when i do the following:
git clone
git checkout 2.1.0
git pull origin 2.1.0 (to be sure :P)
cd cyclecloud-slurm
./docker-rpmbuild.sh
the container starts and builds the *.rpm and *.deb files into /blobs.
When i try to upload or build the project (cyclecloud project upload / cyclecoud project build), i get a file not found error looking for files in ./blobs that have a mismatched version:
project.ini shows the following defined in blobs section (note the versions):
[blobs]
Files = cyclecloud-api-7.9.2.tar.gz, job_submit_cyclecloud_centos_18.08.9-1.so, slurm-18.08.9-1.el7.x86_64.rpm, slurm-contribs-18.08.9-1.el7.x86_64.rpm, etc, etc, etc
however, after running docker-buildrpm.sh, and looking in ./blobs, here is the file listing:
[root@cyclecloud blobs]# ls -tlr
total 77336
-rw-r--r--. 1 root root 16362 Mar 5 19:23 cyclecloud-api-7.9.2.tar.gz
-rw-r--r--. 1 root root 13426172 Mar 15 09:24 slurm-18.08.8-1.el7.x86_64.rpm
-rw-r--r--. 1 root root 796360 Mar 15 09:24 slurm-perlapi-18.08.8-1.el7.x86_64.rpm
-rw-r--r--. 1 root root 78540 Mar 15 09:24 slurm-devel-18.08.8-1.el7.x86_64.rpm
-rw-r--r--. 1 root root 4996 Mar 15 09:24 slurm-example-configs-18.08.8-1.el7.x86_64.rpm
-rw-r--r--. 1 root root 1157424 Mar 15 09:24 slurm-slurmctld-18.08.8-1.el7.x86_64.rpm
_<as well, the files
cyclecloud-api-7.9.2.tar.gz
job_submit_cyclecloud_ubuntu_19.05.5-1.so
job_submit_cyclecloud_centos_19.05.5-1.so
are not seemed to be built by the container with the docker-buildrpm.sh- i had to fetch it from the github releases page (wget)>_
When i changed the VERSION in the 00-build-slurm.sh script to 18.08.9 and re-ran the docker-rpmbuild.sh script, i can see the files are built with the version matching what the project.ini is looking for.
Im not 100% sure this is a bug (if so i can do a PR if needed, seems simple) or if im missing something in the steps to build/deploy this project to locker.
Thanks,
Daniel
root@cyclecloud blobs]# ls -ltr
-rw-r--r--. 1 root root 13424068 Mar 15 09:47 slurm-18.08.9-1.el7.x86_64.rpm
-rw-r--r--. 1 root root 795780 Mar 15 09:47 slurm-perlapi-18.08.9-1.el7.x86_64.rpm
-rw-r--r--. 1 root root 78540 Mar 15 09:47 slurm-devel-18.08.9-1.el7.x86_64.rpm
-rw-r--r--. 1 root root 5000 Mar 15 09:47 slurm-example-configs-18.08.9-1.el7.x86_64.rpm
-rw-r--r--. 1 root root 1157280 Mar 15 09:47 slurm-slurmctld-18.08.9-1.el7.x86_64.rpm
-rw-r--r--. 1 root root 626380 Mar 15 09:47 slurm-slurmd-18.08.9-1.el7.x86_64.rpm
-rw-r--r--. 1 root root 670392 Mar 15 09:47 slurm-slurmdbd-18.08.9-1.el7.x86_64.rpm
-rw-r--r--. 1 root root 140940 Mar 15 09:47 slurm-libpmi-18.08.9-1.el7.x86_64.rpm
-rw-r--r--. 1 root root 116664 Mar 15 09:47 slurm-torque-18.08.9-1.el7.x86_64.rpm
-rw-r--r--. 1 root root 8476 Mar 15 09:47 slurm-openlava-18.08.9-1.el7.x86_64.rpm
-rw-r--r--. 1 root root 16612 Mar 15 09:47 slurm-contribs-18.08.9-1.el7.x86_64.rpm
-rw-r--r--. 1 root root 147612 Mar 15 09:47 slurm-pam_slurm-18.08.9-1.el7.x86_64.rpm
-rwxr-xr-x. 1 root root 198400 Mar 15 09:51 job_submit_cyclecloud_centos_18.08.9-1.so
-rwxr-xr-x. 1 root root 198400 Mar 15 09:51 job_submit_cyclecloud_ubuntu_18.08.9-1.so
Depending on the version of MPI or ISV code being used, occasionally they try to rely on the Slurm nodenames which aren't actual resolvable hostnames. This causes the jobs to fail.
It would be good if the actual hostnames on the nodes and in Azure DNS matched the nodename used in Slurm.
The new append to slurm.conf option (CC portal-->edit-->advanced settings -->Additional slumr conf
doe not appear to be working. The slum.conf is not updated.
CC = 8.4.0
Slurm Cluster-init = 3.0.1
Slurm version = 22.05.8-1
ISSUE
using a dynamic partition with multiple VM types, including GPU, will result in gres.conf built using the entire allotment of cores for the nodearray even if the defined GPU is a subset of the total.
STEPS TO REPRODUCE
Config.Multiselect = true
scontrol create nodename..
for autoscaling the nodes in Slurm...for example:scontrol create NodeName=jm-slurm-multi-low-[1-10] CPUs=2 Boards=1 SocketsPerBoard=1 CoresPerSocket=2 ThreadsPerCore=1 RealMemory=3072 Feature=dyn,Standard_F2s_v2 State=CLOUD
scontrol create NodeName=jm-slurm-multi-mid-[1-10] CPUs=2 Boards=1 SocketsPerBoard=1 CoresPerSocket=2 ThreadsPerCore=1 RealMemory=7168 Feature=dyn,Standard_D2ds_v4 State=CLOUD
scontrol create NodeName=jm-slurm-multi-high-[1-15] CPUs=2 Boards=1 SocketsPerBoard=1 CoresPerSocket=2 ThreadsPerCore=1 RealMemory=15360 Feature=dyn,Standard_E2ds_v4 State=CLOUD
scontrol create NodeName=jm-slurm-multi-gpu-[1-5] CPUs=6 Boards=1 SocketsPerBoard=1 CoresPerSocket=6 ThreadsPerCore=1 RealMemory=54476 Gres=gpu:1 Feature=dyn,Standard_NC6_Promo State=CLOUD
azslurm scale
to build the gres.conf
/etc/slurm/gres.conf
root@jm-slurm-multi-hn:~# cat /etc/slurm/gres.conf
Nodename=jm-slurm-mutli-dynamic-[1-16] Name=gpu Count=1 File=/dev/nvidia0
WORKAROUND
Manually update the gres.conf
for both hostname and quantity
There are important files that Microsoft projects should all have that are not present in this repository. A pull request has been opened to add the missing file(s). When the pr is merged this issue will be closed automatically.
Microsoft teams can learn more about this effort and share feedback within the open source guidance available internally.
I noticed that by default /sched mount in the slurm.txt template has 1T of memory. But as far as I understood it only contained some slurm and cyclecloud configurations. Why does it need all this memory then?
CycleCloud - 8.1.0-1275
Slurm - 19.05.8-1
Scenario:
{{{
This node does not match existing scaleset attributes: Configuration.cyclecloud.mounts.additional_nfs.export_path, Configuration.cyclecloud.mounts.additional_nfs.mountpoint
}}}
can you please update the documentation so that it explains how and where to clone this repo ?
it is said that you have to change directory to the slurm directory, but where is that one ?
I'm not sure if I'm doing something wrong, but I'm utterly failing to use the default Slurm image to do some basic MPI.
I've stood up a simple cyclecloud-slurm setup with the default image (Cycle CentOS 7) for all node types.
I can run a simple single node MPI job as long as I use Intel MPI:
srun -n2 mpirun ./hello
That works fine.
But as soon as I try to use more than one node, I get all sorts of infiniband related error messages.
If I try to use a machinetype that supports Infiniband I end up with no functioning MPI at all, as it appears to not get installed.
Try even a single node MPI test with OpenMPI and I get errors possibly related to hostname issues:
--------------------------------------------------------------------------
ORTE has lost communication with a remote daemon.
HNP daemon : [[22239,0],0] on node ip-0A000406
Remote daemon: [[22239,0],1] on node hpc-1
Looks like OpenMPI is working with the Slurm node name (hpc-1) rather than that actual hostname (ip-0A000406), and then possibly getting upset as a result?
If I've missed something really obvious, please do point it out :)
Thanks,
John
I'm wondering why each job submission ends up in it's dedicated virtual machine although slots are available on e.g. the node of the first job.
We need to make this project's use of physical vs virtual cores and how those are represented configurable.
CoresPerSocket does not seem to be calculated correctly. Here is a 32 vcpu, 16 pcpu VM but does not use 2 for the CoresPerSocket
Nodename=htc-[1-4] Feature=cloud STATE=CLOUD CPUs=16 CoresPerSocket=1 RealMemory=124518
The latest release switched from vcpu to pcpu, so hyperthreading is no longer captured in the configuration. Suggest using threads to allow 2 threads per core/cpu on VMs that support it.
Slurm templates 2.6+ do not have AdditionalClusterInitSpecs for the login nodes
Native Debian (and also Ubuntu) use /etc/default/slurmd/
as system config path, instead of RHEL's /etc/sysconfig/slurmd
.
Therefore, if user use native Slurm, the following code may useless.
cyclecloud-slurm/specs/default/chef/site-cookbooks/slurm/recipes/execute.rb
Lines 112 to 121 in c5de6d6
An fix can be, when node['slurm']['install'] == true
and /etc/default/slurmd/
directory exists, write file into this directory instead.
Hi,
is there any option to change the SLURM installation path?
The benefit would be syncing it with onpremise deployments which makes transition easier.
Thanks
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.