Giter VIP home page Giter VIP logo

cyclecloud-slurm's People

Contributors

aditigaur4 avatar anhoward avatar atomic-penguin avatar bwatrous avatar dougclayton avatar dpwatrous avatar edwardsp avatar jamesongithub avatar jermth avatar microsoft-github-policy-service[bot] avatar microsoftopensource avatar msftgits avatar ryanhamel avatar staer avatar themorey avatar wolfgang-desalvador avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cyclecloud-slurm's Issues

Some modern slurm features not supported

With the currently supported version of Slurm, the slurmrestd service can be built by the addition of a few dependencies in the rpmbuild stage and the addition of an rpmbuild flag:

RUN yum install -y http-parser-devel json-c-devel
RUN rpmbuild  --define "_with_slurmrestd 1" -ta ${SLURM_PKG}

How this would then be added to chef/cloud-init is unfortunately beyond me!

There are also several cloud focused developments (such as burstbuffer/lua for staging of files which is useful when using cloud compute resources) and commonly required features (job script capturing, parsable account queries, etc) which are unavailable due to them only being available in Slurm 21.08.

Is there a timeline for when the slurm version will be bumped to the latest, and is there any scope for enabling additional features like slurmrestd?

Autoscaling doesn't work for pending jobs

I am using cyclecloud with a SLURM queueing system in combination with dask (a Python package that manages the bookkeeping and task distribution such that one doesn't have to write jobscripts and collect the data.)

It's possible to use that adaptive scaling feature of dask which means, that as the load on all my nodes becomes high, it automatically creates new jobs using (in my case) the following job script:

#!/bin/bash

#!/usr/bin/env bash
#SBATCH -J dask-worker
#SBATCH -e TMPDIR/dask-worker-%J.err
#SBATCH -o TMPDIR/dask-worker-%J.out
#SBATCH -p debug
#SBATCH -A WAL
#SBATCH -n 1
#SBATCH --cpus-per-task=1
#SBATCH --mem=15G
#SBATCH -t 240:00:00
JOB_ID=${SLURM_JOB_ID%;*}

export MKL_NUM_THREADS=1
export OPENBLAS_NUM_THREADS=1
export OMP_NUM_THREADS=1

/gscratch/home/t-banij/miniconda3/envs/py37_min/bin/python -m distributed.cli.dask_worker tcp://10.75.64.5:43267 --nthreads 1 --memory-limit 16.00GB --name dask-worker--${JOB_ID}-- --death-timeout 60 --local-directory TMPDIR

This results in many PENDING jobs. Unfortunately, no new nodes are started automatically, I still need to go into the web interface and start new nodes by hand, rendering the autoscale feature I am using useless.

Seems related to #1, #5, and #9.

ResumeTimeout leaves nodes in down~ state

Nodes that do not start within ResumeTimeout (default is 10 minutes) enter the down~ state and will not come out of this. I recommend enabling return_to_idle.sh on all versions of Slurm, not just <= 18.

support RHEL images

the following files need to be updated to recognize RHEL images:

slurm/recipes/execute.rb
Line 68: change to when 'centos', 'rhel'

slurm/recipes/default.rb
Line 89: change to when 'centos', 'rhel'

slurm/attributes/default.rb
Line 11: change to when 'centos', 'rhel'

"slurmctld restart" stuck after scaling the nodes

CycleCloud Version - 8.1.0-1275
Slurm - 19.05.8-1

Scenario:

  1. Changing the Max core count for the HPC array in the CycleCloud UI
  2. Run the scale command (./cyclecloud_slurm.sh scale) and we see below behavior:

{{{
sinfo doesn't show up new added node and it seems slurmctld stuck in restart:
[root@ip-0A060009 slurm]# sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
hpc* up infinite 2 alloc hpc-pg0-[1-2]
htc up infinite 2 idle~ htc-[1-2]

[root@ip-0A060009 slurm]# systemctl status slurmctld
● slurmctld.service - Slurm controller daemon
Loaded: loaded (/usr/lib/systemd/system/slurmctld.service; enabled; vendor preset: disabled)
Drop-In: /etc/systemd/system/slurmctld.service.d
└─override.conf
Active: failed (Result: exit-code) since Thu 2021-02-18 20:42:28 UTC; 3s ago
Process: 11278 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS (code=exited, status=0/SUCCESS)
Main PID: 11280 (code=exited, status=1/FAILURE)

Feb 18 20:42:28 ip-0A060009 systemd[1]: Starting Slurm controller daemon...
Feb 18 20:42:28 ip-0A060009 systemd[1]: Started Slurm controller daemon.
Feb 18 20:42:28 ip-0A060009 systemd[1]: slurmctld.service: main process exited, code=exited, status=1/FAILURE
Feb 18 20:42:28 ip-0A060009 systemd[1]: Unit slurmctld.service entered failed state.
Feb 18 20:42:28 ip-0A060009 systemd[1]: slurmctld.service failed.
}}}

Add -b option to slurmd startup

There are occasions where slurmd times out checking in with slurmctld due to high uptime before slurmd starts. Adding -b to the startup of slurmd works around the issue. We should include it by default.

Slurm Accounting add cluster fails on cluster terminate/restart

Slurm proj: 2.4.4
Slurm ver: 20.11.4-1
CC ver: 8.1.0-1275

A Slurm cluster connected to MariaDB service in Azure will fail to restart cleanly when terminiated/restarted. The following error is seen:

================================================================================
Error executing action `run` on resource 'bash[Add cluster to slurmdbd]'
================================================================================

Mixlib::ShellOut::ShellCommandFailed
------------------------------------
Expected process to exit with [0], but received '1'
---- Begin output of "bash"  "/tmp/chef-script20210401-1390-h8kq8x" ----
STDOUT: This cluster ucla-slurm-docker-grafana already exists.  Not adding.
STDERR: 
---- End output of "bash"  "/tmp/chef-script20210401-1390-h8kq8x" ----
Ran "bash"  "/tmp/chef-script20210401-1390-h8kq8x" returned 1

Resource Declaration:
---------------------
# In /opt/cycle/jetpack/system/chef/cache/cookbooks/slurm/recipes/accounting.rb

 77: bash 'Add cluster to slurmdbd' do
 78:     code <<-EOH
 79:         sacctmgr -i add cluster #{clustername} && touch /etc/slurmdbd.configured 
 80:         EOH
 81:     not_if { ::File.exist?('/etc/slurmdbd.configured') }
 82:     not_if "sleep 5 && sacctmgr show cluster | grep #{clustername}" 
 83: end
 84: 

Compiled Resource:
------------------
# Declared in /opt/cycle/jetpack/system/chef/cache/cookbooks/slurm/recipes/accounting.rb:77:in `from_file'

bash("Add cluster to slurmdbd") do
  action [:run]
  default_guard_interpreter :default
  command nil
  backup 5
  returns 0
  user nil
  interpreter "bash"
  declared_type :bash
  cookbook_name "slurm"
  recipe_name "accounting"
  code "        sacctmgr -i add cluster UCLA-slurm-docker-grafana && touch /etc/slurmdbd.configured \n"
  domain nil
  not_if { #code block }
  not_if "sleep 5 && sacctmgr show cluster | grep UCLA-slurm-docker-grafana"
end

System Info:
------------
chef_version=13.12.14
platform=centos
platform_version=7.7.1908
ruby=ruby 2.5.7p206 (2019-10-01 revision 67816) [x86_64-linux]
program_name=chef-solo worker: ppid=1385;start=17:48:00;
executable=/opt/cycle/jetpack/system/embedded/bin/chef-solo


Running handlers:
  - CycleCloud::ExceptionHandler
  - CycleCloud::ExceptionHandler
Running handlers complete
Chef Client failed. 9 resources updated in 16 seconds
Error: A problem occurred while running Chef, check chef-client.log for details

On the scheduler node I see the cluster already exists in sacctmgr:

   [cyclecloud@glinc5zuux6 ~]$ sacctmgr list cluster
   Cluster     ControlHost  ControlPort   RPC     Share GrpJobs       GrpTRES GrpSubmit MaxJobs       MaxTRES MaxSubmit        MaxWall                  QOS   Def QOS 
---------- --------------- ------------ ----- --------- ------- ------------- --------- ------- ------------- --------- ----------- -------------------- --------- 
ucla-slur+                            0  9216         1                                                                                           normal           

I suspect scheduler.rb line 77 needs a check if cluster exists before trying to add the cluster.

Location of files in the Blobs

Hello,

There are a bunch of files listed in the files list presented in the project.ini.

I understand that Github is not an object storage and hence the files are not in the repository. Could someone point me to the files ?

No autoscaling when using Centos8 base image

Hi, it seems that autoscaling no longer works with Centos8

Tested with :

Cyclecloud Version: 8.1.0-1275
Cyclecloud-Slurm 2.4.2

Results:
Centos8 + Slurm - 20.11.0-1 = No Autoscaling
Centos7 + Slurm - 20.11.0-1 = Autoscale works

[root@ip-0A781804 slurmctld]# cat slurmctld.log
[2020-12-18T00:15:27.016] debug:  Log file re-opened
[2020-12-18T00:15:27.020] debug:  creating clustername file: /var/spool/slurmd/clustername
[2020-12-18T00:15:27.021] error: Configured MailProg is invalid
[2020-12-18T00:15:27.021] slurmctld version 20.11.0 started on cluster asdasd
[2020-12-18T00:15:27.021] cred/munge: init: Munge credential signature plugin loaded
[2020-12-18T00:15:27.021] debug:  auth/munge: init: Munge authentication plugin loaded
[2020-12-18T00:15:27.021] select/cons_res: common_init: select/cons_res loaded
[2020-12-18T00:15:27.021] select/cons_tres: common_init: select/cons_tres loaded
[2020-12-18T00:15:27.021] select/cray_aries: init: Cray/Aries node selection plugin loaded
[2020-12-18T00:15:27.021] select/linear: init: Linear node selection plugin loaded with argument 20
[2020-12-18T00:15:27.021] preempt/none: init: preempt/none loaded
[2020-12-18T00:15:27.021] debug:  acct_gather_energy/none: init: AcctGatherEnergy NONE plugin loaded
[2020-12-18T00:15:27.021] debug:  acct_gather_Profile/none: init: AcctGatherProfile NONE plugin loaded
[2020-12-18T00:15:27.021] debug:  acct_gather_interconnect/none: init: AcctGatherInterconnect NONE plugin loaded
[2020-12-18T00:15:27.021] debug:  acct_gather_filesystem/none: init: AcctGatherFilesystem NONE plugin loaded
[2020-12-18T00:15:27.022] debug:  jobacct_gather/none: init: Job accounting gather NOT_INVOKED plugin loaded
[2020-12-18T00:15:27.022] ext_sensors/none: init: ExtSensors NONE plugin loaded
[2020-12-18T00:15:27.022] debug:  switch/none: init: switch NONE plugin loaded
[2020-12-18T00:15:27.022] debug:  shutting down backup controllers (my index: 0)
[2020-12-18T00:15:27.022] accounting_storage/none: init: Accounting storage NOT INVOKED plugin loaded
[2020-12-18T00:15:27.022] debug:  create_mmap_buf: Failed to open file `/var/spool/slurmd/assoc_usage`, No such file or directory
[2020-12-18T00:15:27.022] debug:  Reading slurm.conf file: /etc/slurm/slurm.conf
[2020-12-18T00:15:27.023] debug:  NodeNames=hpc-pg0-[1-4] setting Sockets=60 based on CPUs(60)/(CoresPerSocket(1)/ThreadsPerCore(1))
[2020-12-18T00:15:27.023] debug:  NodeNames=htc-[1-5] setting Sockets=60 based on CPUs(60)/(CoresPerSocket(1)/ThreadsPerCore(1))
[2020-12-18T00:15:27.023] debug:  Reading cgroup.conf file /etc/slurm/cgroup.conf
[2020-12-18T00:15:27.023] topology/tree: init: topology tree plugin loaded
[2020-12-18T00:15:27.023] debug:  No DownNodes
[2020-12-18T00:15:27.023] debug:  create_mmap_buf: Failed to open file `/var/spool/slurmd/last_config_lite`, No such file or directory
[2020-12-18T00:15:27.140] debug:  Log file re-opened
[2020-12-18T00:15:27.141] sched: Backfill scheduler plugin loaded
[2020-12-18T00:15:27.141] debug:  topology/tree: _read_topo_file: Reading the topology.conf file
[2020-12-18T00:15:27.141] topology/tree: _validate_switches: TOPOLOGY: warning -- no switch can reach all nodes through its descendants. If this is not intentional, fix the topology.conf file.
[2020-12-18T00:15:27.141] debug:  topology/tree: _log_switches: Switch level:0 name:hpc-Standard_HB60rs-pg0 nodes:hpc-pg0-[1-4] switches:(null)
[2020-12-18T00:15:27.141] debug:  topology/tree: _log_switches: Switch level:0 name:htc nodes:htc-[1-5] switches:(null)
[2020-12-18T00:15:27.141] route/default: init: route default plugin loaded
[2020-12-18T00:15:27.141] debug:  create_mmap_buf: Failed to open file `/var/spool/slurmd/node_state`, No such file or directory
[2020-12-18T00:15:27.141] error: Could not open node state file /var/spool/slurmd/node_state: No such file or directory
[2020-12-18T00:15:27.141] error: NOTE: Trying backup state save file. Information may be lost!
[2020-12-18T00:15:27.141] debug:  create_mmap_buf: Failed to open file `/var/spool/slurmd/node_state.old`, No such file or directory
[2020-12-18T00:15:27.141] No node state file (/var/spool/slurmd/node_state.old) to recover
[2020-12-18T00:15:27.141] debug:  create_mmap_buf: Failed to open file `/var/spool/slurmd/job_state`, No such file or directory
[2020-12-18T00:15:27.141] error: Could not open job state file /var/spool/slurmd/job_state: No such file or directory
[2020-12-18T00:15:27.141] error: NOTE: Trying backup state save file. Jobs may be lost!
[2020-12-18T00:15:27.141] debug:  create_mmap_buf: Failed to open file `/var/spool/slurmd/job_state.old`, No such file or directory
[2020-12-18T00:15:27.142] No job state file (/var/spool/slurmd/job_state.old) to recover
[2020-12-18T00:15:27.142] select/cons_tres: part_data_create_array: select/cons_tres: preparing for 2 partitions
[2020-12-18T00:15:27.142] debug:  gres/gpu: init: loaded
[2020-12-18T00:15:27.142] debug:  gpu/generic: init: init: GPU Generic plugin loaded
[2020-12-18T00:15:27.142] debug:  gres/gpu: init: loaded
[2020-12-18T00:15:27.142] debug:  gres/gpu: init: loaded
[2020-12-18T00:15:27.142] debug:  gres/gpu: init: loaded
[2020-12-18T00:15:27.142] debug:  gres/gpu: init: loaded
[2020-12-18T00:15:27.142] debug:  gres/gpu: init: loaded
[2020-12-18T00:15:27.142] debug:  gres/gpu: init: loaded
[2020-12-18T00:15:27.142] debug:  gres/gpu: init: loaded
[2020-12-18T00:15:27.142] debug:  gres/gpu: init: loaded
[2020-12-18T00:15:27.142] debug:  Updating partition uid access list
[2020-12-18T00:15:27.142] debug:  create_mmap_buf: Failed to open file `/var/spool/slurmd/resv_state`, No such file or directory
[2020-12-18T00:15:27.143] error: Could not open reservation state file /var/spool/slurmd/resv_state: No such file or directory
[2020-12-18T00:15:27.143] error: NOTE: Trying backup state save file. Reservations may be lost
[2020-12-18T00:15:27.143] debug:  create_mmap_buf: Failed to open file `/var/spool/slurmd/resv_state.old`, No such file or directory
[2020-12-18T00:15:27.143] No reservation state file (/var/spool/slurmd/resv_state.old) to recover
[2020-12-18T00:15:27.143] debug:  create_mmap_buf: Failed to open file `/var/spool/slurmd/trigger_state`, No such file or directory
[2020-12-18T00:15:27.143] error: Could not open trigger state file /var/spool/slurmd/trigger_state: No such file or directory
[2020-12-18T00:15:27.143] error: NOTE: Trying backup state save file. Triggers may be lost!
[2020-12-18T00:15:27.143] debug:  create_mmap_buf: Failed to open file `/var/spool/slurmd/trigger_state.old`, No such file or directory
[2020-12-18T00:15:27.143] No trigger state file (/var/spool/slurmd/trigger_state.old) to recover
[2020-12-18T00:15:27.143] read_slurm_conf: backup_controller not specified
[2020-12-18T00:15:27.143] Reinitializing job accounting state
[2020-12-18T00:15:27.143] select/cons_tres: select_p_reconfigure: select/cons_tres: reconfigure
[2020-12-18T00:15:27.143] select/cons_tres: part_data_create_array: select/cons_tres: preparing for 2 partitions
[2020-12-18T00:15:27.143] Running as primary controller
[2020-12-18T00:15:27.143] debug:  No backup controllers, not launching heartbeat.
[2020-12-18T00:15:27.143] debug:  priority/basic: init: Priority BASIC plugin loaded
[2020-12-18T00:15:27.143] No parameter for mcs plugin, default values set
[2020-12-18T00:15:27.143] mcs: MCSParameters = (null). ondemand set.
[2020-12-18T00:15:27.143] debug:  mcs/none: init: mcs none plugin loaded
[2020-12-18T00:15:57.143] debug:  sched/backfill: _attempt_backfill: beginning
[2020-12-18T00:15:57.143] debug:  sched/backfill: _attempt_backfill: no jobs to backfill
[2020-12-18T00:16:27.212] SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2
[2020-12-18T00:16:27.212] debug:  sched: Running job scheduler
[2020-12-18T00:17:27.284] debug:  sched: Running job scheduler
[2020-12-18T00:17:27.285] debug:  shutting down backup controllers (my index: 0)
[2020-12-18T00:18:27.362] debug:  sched: Running job scheduler
[2020-12-18T00:19:27.438] debug:  sched: Running job scheduler
[2020-12-18T00:19:27.438] debug:  shutting down backup controllers (my index: 0)
[2020-12-18T00:20:27.512] debug:  sched: Running job scheduler
[2020-12-18T00:20:27.513] debug:  create_mmap_buf: Failed to open file `/var/spool/slurmd/job_state`, No such file or directory
[2020-12-18T00:20:27.513] error: Could not open job state file /var/spool/slurmd/job_state: No such file or directory
[2020-12-18T00:20:27.513] error: NOTE: Trying backup state save file. Jobs may be lost!
[2020-12-18T00:20:27.513] debug:  create_mmap_buf: Failed to open file `/var/spool/slurmd/job_state.old`, No such file or directory
[2020-12-18T00:20:27.513] No job state file (/var/spool/slurmd/job_state.old) found
[2020-12-18T00:21:27.676] debug:  sched: Running job scheduler
[2020-12-18T00:21:27.676] debug:  shutting down backup controllers (my index: 0)
[2020-12-18T00:22:27.749] debug:  sched: Running job scheduler
[2020-12-18T00:23:27.821] debug:  sched: Running job scheduler
[2020-12-18T00:23:27.821] debug:  shutting down backup controllers (my index: 0)
[2020-12-18T00:24:27.892] debug:  sched: Running job scheduler
[2020-12-18T00:25:27.966] debug:  Updating partition uid access list
[2020-12-18T00:25:27.966] debug:  sched: Running job scheduler
[2020-12-18T00:25:28.067] debug:  shutting down backup controllers (my index: 0)
[2020-12-18T00:26:27.139] debug:  sched: Running job scheduler
[2020-12-18T00:27:27.212] debug:  sched: Running job scheduler
[2020-12-18T00:27:28.214] debug:  shutting down backup controllers (my index: 0)
[2020-12-18T00:28:27.286] debug:  sched: Running job scheduler
[2020-12-18T00:29:27.361] debug:  sched: Running job scheduler
[2020-12-18T00:29:28.362] debug:  shutting down backup controllers (my index: 0)
[2020-12-18T00:30:27.437] debug:  sched: Running job scheduler
[2020-12-18T00:30:55.711] req_switch=-2 network='(null)'
[2020-12-18T00:30:55.711] Setting reqswitch to 1.
[2020-12-18T00:30:55.711] returning.
[2020-12-18T00:30:55.712] sched: _slurm_rpc_allocate_resources JobId=2 NodeList=htc-1 usec=1268
[2020-12-18T00:30:56.261] debug:  sched/backfill: _attempt_backfill: beginning
[2020-12-18T00:30:56.261] debug:  sched/backfill: _attempt_backfill: no jobs to backfill
[2020-12-18T00:30:57.263] error: power_save: program exit status of 1
[2020-12-18T00:31:27.588] debug:  sched: Running job scheduler
[2020-12-18T00:31:28.589] debug:  shutting down backup controllers (my index: 0)
[root@ip-0A781804 slurmctld]# squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                 2       htc hostname  andreim CF       1:37      1 htc-1

[root@ip-0A781804 slurmctld]# sinfo -V
slurm 20.11.0

[root@ip-0A781804 slurmctld]# systemctl status slurmctld.service
● slurmctld.service - Slurm controller daemon
   Loaded: loaded (/usr/lib/systemd/system/slurmctld.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/slurmctld.service.d
           └─override.conf
   Active: active (running) since Fri 2020-12-18 00:15:26 UTC; 17min ago
 Main PID: 3980 (slurmctld)
    Tasks: 8
   Memory: 5.5M
   CGroup: /system.slice/slurmctld.service
           └─3980 /usr/sbin/slurmctld -D

Dec 18 00:15:26 ip-0A781804 systemd[1]: Started Slurm controller daemon.
[root@ip-0A781804 slurm]# cat topology.conf
SwitchName=hpc-Standard_HB60rs-pg0 Nodes=hpc-pg0-[1-4]
SwitchName=htc Nodes=htc-[1-5]

[root@ip-0A781804 slurm]# cat slurm.conf
MpiDefault=none
ProctrackType=proctrack/cgroup
ReturnToService=2
PropagateResourceLimits=ALL
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmdPidFile=/var/run/slurmd.pid
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser="slurm"
StateSaveLocation=/var/spool/slurmd
SwitchType=switch/none
TaskPlugin=task/affinity,task/cgroup
SchedulerType=sched/backfill
SelectType=select/cons_tres
GresTypes=gpu
SelectTypeParameters=CR_Core_Memory
ClusterName="ASDASD"
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=debug
SlurmctldLogFile=/var/log/slurmctld/slurmctld.log
SlurmctldParameters=idle_on_node_suspend
SlurmdDebug=debug
SlurmdLogFile=/var/log/slurmd/slurmd.log
TopologyPlugin=topology/tree
JobSubmitPlugins=job_submit/cyclecloud
PrivateData=cloud
TreeWidth=65533
ResumeTimeout=1800
SuspendTimeout=600
SuspendTime=300
ResumeProgram=/opt/cycle/jetpack/system/bootstrap/slurm/resume_program.sh
ResumeFailProgram=/opt/cycle/jetpack/system/bootstrap/slurm/resume_fail_program.sh
SuspendProgram=/opt/cycle/jetpack/system/bootstrap/slurm/suspend_program.sh
SchedulerParameters=max_switch_wait=24:00:00
AccountingStorageType=accounting_storage/none
Include cyclecloud.conf
SlurmctldHost=ip-0A781804

[root@ip-0A781804 slurm]# cat cyclecloud.conf
# Note: CycleCloud reported a RealMemory of 228884 but we reduced it by 11444 (i.e. max(1gb, 5%)) to account for OS/VM overhead which
# would result in the nodes being rejected by Slurm if they report a number less than defined here.
# To pick a different percentage to dampen, set slurm.dampen_memory=X in the nodearray's Configuration where X is percentage (5 = 5%).
PartitionName=hpc Nodes=hpc-pg0-[1-4] Default=YES DefMemPerCPU=3624 MaxTime=INFINITE State=UP
Nodename=hpc-pg0-[1-4] Feature=cloud STATE=CLOUD CPUs=60 CoresPerSocket=1 RealMemory=217440
# Note: CycleCloud reported a RealMemory of 228884 but we reduced it by 11444 (i.e. max(1gb, 5%)) to account for OS/VM overhead which
# would result in the nodes being rejected by Slurm if they report a number less than defined here.
# To pick a different percentage to dampen, set slurm.dampen_memory=X in the nodearray's Configuration where X is percentage (5 = 5%).
PartitionName=htc Nodes=htc-[1-5] Default=NO DefMemPerCPU=3624 MaxTime=INFINITE State=UP
Nodename=htc-[1-5] Feature=cloud STATE=CLOUD CPUs=60 CoresPerSocket=1 RealMemory=217440

Confusing PCPU vs VCPU behavior - Standard_F32s_v2

Hello,

I have configured a partition with nodetype Standard_F32s_v2 (or different CPUs).
Cyclecloud-slurm.sh (v2.5.1) creates an "invalid configuration" compared to the output of "slurmd -C" (v20.11.8).

cyclecloud:
Feature=cloud STATE=CLOUD CPUs=16 ThreadsPerCore=2 RealMemory=62914

slurmd -C
CPUs=32 Boards=1 SocketsPerBoard=1 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=64401

Connection refused for `Isend` over OpenMPI for `F2s_v2` nodes

Hello,

OpenMPI was recently upgraded from version 4.0.5 to 4.1.0 on CycleCloud. Since the upgrade I'm having issues using non-blocking communication with Slurm on CycleCloud.

First, I have to use -mca ^hcoll to avoid warnings regarding InfiniBand, which F2s_v2 nodes are not equipped with. I had this issue also with version 4.0.5.

Second, since the recent upgrade to OpenMPI v4.1.0, non-blocking communication has stopped working for me. The code I have worked for OpenMPI v4.0.5.

This is the error I'm getting. I've confirmed that the problem occurs when I call MPI_Isend. I'm attaching a small example to reproduce this problem below.

Process 1 started 
Initiating communication on worker 1
[1622528555.546527] [ip-0A000007:9268 :0] sock.c:259 UCX ERROR connect(fd=30, dest_addr=127.0.0.1:58173) failed: Connection refused
[ip-0A000007:09268] pml_ucx.c:383  Error: ucp_ep_create(proc=0) failed: Destination is unreachable
[ip-0A000007:09268] pml_ucx.c:453  Error: Failed to resolve UCX endpoint for rank 0
[ip-0A000007:09268] *** An error occurred in MPI_Isend
[ip-0A000007:09268] *** reported by process [3673554945,1]
[ip-0A000007:09268] *** on communicator MPI_COMM_WORLD
[ip-0A000007:09268] *** MPI_ERR_OTHER: known error not in list
[ip-0A000007:09268] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[ip-0A000007:09268] ***    and potentially your MPI job)

Program code (mpi_isend.c):

#include <mpi.h>
#include <math.h>
#include <stdio.h>

int main(int argc, char* argv[]) {
      MPI_Comm comm;
      MPI_Request request;
      MPI_Status status;
      int myid, master, tag, proc, my_int, p, ierr;

      comm = MPI_COMM_WORLD;       
      ierr = MPI_Init(&argc, &argv);           /* starts MPI */
      MPI_Comm_rank(comm, &myid);           /* get current process id */
      MPI_Comm_size(comm, &p);               /* get number of processes */

      master = 0;
      tag = 123;        /* set the tag to identify this particular job */
      printf("Process %d started", myid);

      if(myid == master) {
            for (proc=1;proc<p;proc++) {
                  MPI_Recv(
                        &my_int, 1, MPI_FLOAT,    /* triplet of buffer, size, data type */
                        MPI_ANY_SOURCE,       /* message source */
                        MPI_ANY_TAG,          /* message tag */
                        comm, &status);        /* status identifies source, tag */
                  printf("Received from 1 worker");
            }
            printf("Master finished");
      } else {
	    printf("Initiating communication on worker %d", myid);
            MPI_Isend(       /* non-blocking send */
      	      &my_int, 1, MPI_FLOAT,       /* triplet of buffer, size, data type */
                  master, 
                  tag,
                  comm, 
                  &request);       /* send my_int to master */
            MPI_Wait(&request, &status);    /* block until Isend is done */
	    printf("Worker %d finished", myid);
      }
      MPI_Finalize();                        /* let MPI finish up ... */
}

Jobfile (mpi_isend.job):

#!/bin/sh -l
#SBATCH --job-name=pool
#SBATCH --output=pool.out
#SBATCH --nodes=2
#SBATCH --time=600:00
#SBATCH --tasks-per-node=1
#SBATCH --partition=hpc
mpirun -mca ^hcoll mpi_isend

Steps to reproduce:

mpicc mpi_isend.c -o mpi_isend
sbatch mpi_isend.job

Chef error while upgrading to CycleCloud Slurm 2.4.8

My current versions are:

CycleCloud: 8.2.0-1616
Slurm: 20.11.7-1
CycleCloud-Slurm: 2.4.7
OS: CentOS Linux release 7.8.2003 (Core)

While upgrading CycleCloud-Slurm to version 2.4.8 I encountered the following error on the scheduler node while starting my cluster. I know this is a prerelease, but just wanted to make you aware of it. Hope this helps :-)

Chef::Mixin::Template::TemplateError: undefined method `[]' for nil:NilClass
Software Configuration
Review local log files on the VM at /opt/cycle/jetpack/logs
Get more help on this issue
Detail:

/opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/mixin/template.rb:163:in rescue in _render_template' /opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/mixin/template.rb:159:in _render_template'
/opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/mixin/template.rb:147:in render_template' /opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/provider/template/content.rb:76:in file_for_provider'
/opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/file_content_management/content_base.rb:40:in tempfile' /opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/provider/file.rb:450:in tempfile'
/opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/provider/file.rb:327:in do_generate_content' /opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/provider/file.rb:140:in action_create'
/opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/provider/file.rb:152:in action_create_if_missing' /opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/provider.rb:171:in run_action'
/opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/resource.rb:592:in run_action' /opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/runner.rb:70:in run_action'
/opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/runner.rb:98:in block (2 levels) in converge' /opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/runner.rb:98:in each'
/opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/runner.rb:98:in block in converge' /opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/resource_collection/resource_list.rb:94:in block in execute_each_resource'
/opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/resource_collection/stepable_iterator.rb:114:in call_iterator_block' /opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/resource_collection/stepable_iterator.rb:85:in step'
/opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/resource_collection/stepable_iterator.rb:103:in iterate' /opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/resource_collection/stepable_iterator.rb:55:in each_with_index'
/opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/resource_collection/resource_list.rb:92:in execute_each_resource' /opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/runner.rb:97:in converge'
/opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/client.rb:718:in block in converge' /opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/client.rb:713:in catch'
/opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/client.rb:713:in converge' /opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/client.rb:752:in converge_and_save'
/opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/client.rb:286:in run' /opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/application.rb:292:in block in fork_chef_client'
/opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/application.rb:280:in fork' /opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/application.rb:280:in fork_chef_client'
/opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/application.rb:245:in block in run_chef_client' /opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/local_mode.rb:44:in with_server_connectivity'
/opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/application.rb:233:in run_chef_client' /opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/application/client.rb:470:in sleep_then_run_chef_client'
/opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/application/client.rb:459:in block in interval_run_chef_client' /opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/application/client.rb:458:in loop'
/opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/application/client.rb:458:in interval_run_chef_client' /opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/application/client.rb:442:in run_application'
/opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/application.rb:59:in run' /opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/application/solo.rb:225:in run'
/opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/bin/chef-solo:25:in <top (required)>' /opt/cycle/jetpack/system/embedded/bin/chef-solo:23:in load'
/opt/cycle/jetpack/system/embedded/bin/chef-solo:23:in `

'

1 node with this status

Incompatibility with tagging policies

When attempting to build clusters with one resource group per cluster, the build will fail if there are any policies enforcing the creation of certain tags on a resource group. Please could you include an option to include arbitrary tags on the resource groups/resources created when deploying a cluster?

Support for Multiple VM Sizes per Partition

The current cyclecloud_slurm does not support either multiple MachineType values per nodearray, nor multiple nodearrays assigned to the same Slurm partition. If multiple values for either are supplied, the python code will take only the first value in the list. Remarks in the partition class definition say that a one-to-one mapping of partition names to nodearrays is required.

Cyclecloud cluster templates themselves support multiple machine type values per nodearray and Slurm supports multiple machine types per partition. The current limitation of one machine type per partition is a function of the Cyclecloud implementation. Users of a cluster would benefit from being able to ask for a number of cores in a single partition and having the scheduler determine which size VM to create.

slurm-libpmi is missing

Hi all,
I am trying to install openmpi with slurm support on cyclecloud:

spack install hpl^openmpi+pmi schedulers=slurm

But I get this error:

'/configure' '--prefix=/mnt/exports/shared/bin/spack/opt/spack/linux-centos7-broadwell/gcc-8.3.1/openmpi-3.1.5-3uku7u7irjzvpxy5uwajd34ksg5vyyim' '--enable-shared' '--with-wrapper-ldflags=' '--with-pmi=/usr' '--enable-static' '--enable-mpi-cxx' '--with-zlib=/mnt/exports/shared/bin/spack/opt/spack/linux-centos7-broadwell/gcc-8.3.1/zlib-1.2.11-7zv5elkj6r5xcrw4mifho5mfhi6wuwpq' '--without-psm' '--without-libfabric' '--without-ucx' '--without-mxm' '--without-verbs' '--without-psm2' '--without-alps' '--without-lsf' '--without-sge' '--with-slurm' '--without-tm' '--without-loadleveler' '--disable-memchecker' '--with-hwloc=/mnt/exports/shared/bin/spack/opt/spack/linux-centos7-broadwell/gcc-8.3.1/hwloc-1.11.11-7tt3mdset5tqvje4uaonjdqm2b3koihp' '--disable-java' '--disable-mpi-java' '--without-cuda' '--enable-cxx-exceptions'

1 error found in build log:
     1125    configure: WARNING:     /usr/slurm
     1126    configure: WARNING: Specified path: /usr
     1127    configure: WARNING: OR neither libpmi nor libpmi2 were found under
             :
     1128    configure: WARNING:     /lib
     1129    configure: WARNING:     /lib64
     1130    configure: WARNING: Specified path:
  >> 1131    configure: error: Aborting

It is a very similar problem to: aws/aws-parallelcluster#1008
PMI has been split due to this bug: https://bugs.schedmd.com/show_bug.cgi?id=4511

May I ask to add the package slurm-libpmi?

Thanks

Slow autoscale with Array jobs

Autoscaling after submitting an Slurm Array type job results in very slow spinning up of the cluster.
This appears to be because even with an arbitrarily large amount of array jobs specified:

  • only a single extra node is requested,
  • this then spins up,
  • once spun-up, the next array job starts, and only then is a new node is requested.

Due to the ~5-10 minutes to spin up an instance, this severely limits autoscaling.

A work around is to submit a dummy non-array job requesting a suitable number of nodes.
Even if this job is stuck behind the Array job in the queue, it still results in the cluster spinning up many nodes at once, which the array jobs are then dispatched to.

e.g. to request 100 CPUs worth of nodes
echo '#!/bin/sh' | sbatch -n 100

Race condition when two jobs request the same node

For jobs that can run on or have requested the same node, there's a race condition where the second job to request the node may fail with a communication error.

Working on a workaround to get the job to block until the node is completely up and registered with Slurm. It's not clear yet if this is a bug in Slurm or an issue with the Chef recipe order of operations.

additional support for RHEL clones (almalinux, rockylinux)

Hi,

as new RHEL - clones take over CentOS lack of bug-compatibility I'd consider AlmaLinux and Rocky Linux as major successor platforms.
Currently this repo only supports centos and rhel as platform strings but should support others, too - especially as those are bug-compatible builds.
I understand though, that current CycleCloud support is limited to ubunto, centos, rhel.

Thanks

Slurm nodenames do not match azure hostnames - so head node cannot communicate with nodes

This is my first stab at setting up Azure HPC using CycleCloud and Slurm, so forgive me for stupid mistakes...

Also, if this is the wrong place for this - please point me to where I should post support questions...

I have built a simple (default) Slurm cluster using CycleCloud and the nodes start/stop OK, but when I run a simple (hostname) job it just hangs at "Completing".

Debug logging seems to suggest that the Slurm Scheduler node cannot communicate with the nodes:

[azccadmin@SlurmCluster-1-scheduler data]$ srun -N2 -n2 -t00:15:00 -Jgrma-hostname hostname.sh
srun: error: get_addr_info: getaddrinfo() failed: Name or service not known
srun: error: slurm_set_addr: Unable to resolve "slurmcluster-1-hpc-pg0-1"
srun: error: fwd_tree_thread: can't find address for host slurmcluster-1-hpc-pg0-1, check slurm.conf
srun: error: get_addr_info: getaddrinfo() failed: Name or service not known
srun: error: slurm_set_addr: Unable to resolve "slurmcluster-1-hpc-pg0-2"
srun: error: fwd_tree_thread: can't find address for host slurmcluster-1-hpc-pg0-2, check slurm.conf
srun: error: Task launch for StepId=1.0 failed on node slurmcluster-1-hpc-pg0-2: Can't find an address, check slurm.conf
srun: error: Task launch for StepId=1.0 failed on node slurmcluster-1-hpc-pg0-1: Can't find an address, check slurm.conf
srun: error: Application launch failed: Can't find an address, check slurm.conf
srun: Job step aborted

I cannot ping the node from the Scheduler node:

[azccadmin@SlurmCluster-1-scheduler ~]$ ping slurmcluster-1-hpc-pg0-1
ping: slurmcluster-1-hpc-pg0-1: Name or service not known

Slurm is trying to talk to the node as "slurmcluster-1-hpc-pg0-1", but the hostname of the node is actually "slurmcluster-1-slurmcluster-1-hpc-pg0-1"

And I can ping it with this name from the Scheduler node:

[azccadmin@SlurmCluster-1-scheduler data]$ ping slurmcluster-1-slurmcluster-1-hpc-pg0-1
PING slurmcluster-1-slurmcluster-1-hpc-pg0-1.yr5ran05dk5uzlz13kqz0cq4xe.ax.internal.cloudapp.net (192.168.140.6) 56(84) bytes of data.
64 bytes from slurmcluster-1-slurmcluster-1-hpc-pg0-1.internal.cloudapp.net (192.168.140.6): icmp_seq=1 ttl=64 time=1.77 ms

I am using CycleCloud 8.3 - which I note has a fix for Slurm NodeName / Azure Hostname (but this seems to still be an issue)?

Thanks

Gary

SLURM 2.4.8 pmix libs are not included

SLURM 2.4.8 is compiled with pmix support so MPI libraries will interface correctly with SLURM srun.
It seems that SLURM 2.4.8 expects the pmix libs to be install on the compute nodes at /opt/pmix/v3, but these do not exist and need to be built (e.g cluster-init project).

#!/bin/bash

cd ~/
mkdir -p /opt/pmix/v3
apt install -y libevent-dev
tar xvf $CYCLECLOUD_SPEC_PATH/files/openpmix-3.1.6.tar.gz
cd openpmix-3.1.6
#mkdir -p pmix/build/v3 pmix/install/v3
#cd pmix
#git clone https://github.com/openpmix/openpmix.git source
#cd source/
#git branch -a
#git checkout v3.1
#git pull
./autogen.sh
#cd ../build/v3/
./configure --prefix=/opt/pmix/v3
make -j install >/dev/null

Can there pmix libs be included with SLURM versions built with pmix support (SLURM 2.4.8+)?

Slurm headless/burst topology.conf when "scale" cluster

When the cluster is scaled (ie. cyclecloud_slurm.sh scale) the topology.conf file is overwritten. That also overwrites the onprem switch definitions. Is it possible to prevent this or separate the onprem topology from the Cycle topology?

Mounting docker volume in docker script causes a problem

docker run -v $(pwd)/specs/default/cluster-init/files:/source -v $(pwd)/blobs:/root/rpmbuild/RPMS/x86_64 -ti centos:7 /bin/bash /source/00-build-slurm.sh triggers the error:
/bin/bash: /source/00-build-slurm.sh: Permission denied

because inside docker the privilliges of source dir are the following

[root@11d3f968ca8c /]# ll
total 12
-rw-r--r--. 1 root root 12082 Mar 5 17:36 anaconda-post.log
drwxrwxr-x. 2 1002 1002 49 Apr 11 12:55 source

SlurmDBD role for scheduler HA

Per discussion with @anhoward there is a need for a new role for scenario with HA schedulers and Slurm Accounting. When Primary scheduler "fails" the SlurmDBD should be able to run on the HA scheduler.

GPU allocation stuck on "Waiting for resource configuration"

I've uploaded the version 2.7.1 to the cluster (after downloading the blobs from github). I've modified the template to include a gpu cluster along with parameter definitions corresponding to that cluster. I've selected NC6 as the GPU machine name and started the cluster. Everything starts fine and I'm able to allocate the F32_vs nodes that correspond to the hpc partition. However when allocating the GPU node, the node starts without any error reported in the console, however, in the scheduler, slurm does not appear to recognize this fact and is still stuck on

[hpcadmin@ip-0A030006 ~]$ salloc -p gpu -n 1
salloc: Granted job allocation 3
salloc: Waiting for resource configuration

This is a problem with both the CentOS 7 and Almalinux 8 operating systems. I have the following cloud-init scripts to install singularity in each case. These don't appear to be a problem as any error in the script typically gets reported as an error in the web console.

CentOS 7:

#!/usr/bin/bash
wget https://github.com/sylabs/singularity/releases/download/v3.9.9/singularity-ce-3.9.9-1.el7.x86_64.rpm
sudo yum localinstall -y ./singularity-ce-3.9.9-1.el7.x86_64.rpm
rm ./singularity-ce-3.9.9-1.el7.x86_64.rpm

Almalinux 8

#!/usr/bin/bash
wget https://github.com/sylabs/singularity/releases/download/v3.9.9/singularity-ce-3.9.9-1.el8.x86_64.rpm
sudo yum localinstall -y ./singularity-ce-3.9.9-1.el8.x86_64.rpm
rm ./singularity-ce-3.9.9-1.el8.x86_64.rpm

Version mismatch between docker-rpmbuild and 00-build-slurm.sh

Hello.

I think there is a small bug in the 2.1.0 release.

when i do the following:

git clone
git checkout 2.1.0
git pull origin 2.1.0 (to be sure :P)
cd cyclecloud-slurm
./docker-rpmbuild.sh

the container starts and builds the *.rpm and *.deb files into /blobs.
When i try to upload or build the project (cyclecloud project upload / cyclecoud project build), i get a file not found error looking for files in ./blobs that have a mismatched version:

project.ini shows the following defined in blobs section (note the versions):

[blobs]
Files = cyclecloud-api-7.9.2.tar.gz, job_submit_cyclecloud_centos_18.08.9-1.so, slurm-18.08.9-1.el7.x86_64.rpm, slurm-contribs-18.08.9-1.el7.x86_64.rpm, etc, etc, etc

however, after running docker-buildrpm.sh, and looking in ./blobs, here is the file listing:

[root@cyclecloud blobs]# ls -tlr
total 77336
-rw-r--r--. 1 root root 16362 Mar 5 19:23 cyclecloud-api-7.9.2.tar.gz
-rw-r--r--. 1 root root 13426172 Mar 15 09:24 slurm-18.08.8-1.el7.x86_64.rpm
-rw-r--r--. 1 root root 796360 Mar 15 09:24 slurm-perlapi-18.08.8-1.el7.x86_64.rpm
-rw-r--r--. 1 root root 78540 Mar 15 09:24 slurm-devel-18.08.8-1.el7.x86_64.rpm
-rw-r--r--. 1 root root 4996 Mar 15 09:24 slurm-example-configs-18.08.8-1.el7.x86_64.rpm
-rw-r--r--. 1 root root 1157424 Mar 15 09:24 slurm-slurmctld-18.08.8-1.el7.x86_64.rpm

_<as well, the files

cyclecloud-api-7.9.2.tar.gz
job_submit_cyclecloud_ubuntu_19.05.5-1.so
job_submit_cyclecloud_centos_19.05.5-1.so

are not seemed to be built by the container with the docker-buildrpm.sh- i had to fetch it from the github releases page (wget)>_

When i changed the VERSION in the 00-build-slurm.sh script to 18.08.9 and re-ran the docker-rpmbuild.sh script, i can see the files are built with the version matching what the project.ini is looking for.

Im not 100% sure this is a bug (if so i can do a PR if needed, seems simple) or if im missing something in the steps to build/deploy this project to locker.

Thanks,
Daniel

root@cyclecloud blobs]# ls -ltr
-rw-r--r--. 1 root root 13424068 Mar 15 09:47 slurm-18.08.9-1.el7.x86_64.rpm
-rw-r--r--. 1 root root 795780 Mar 15 09:47 slurm-perlapi-18.08.9-1.el7.x86_64.rpm
-rw-r--r--. 1 root root 78540 Mar 15 09:47 slurm-devel-18.08.9-1.el7.x86_64.rpm
-rw-r--r--. 1 root root 5000 Mar 15 09:47 slurm-example-configs-18.08.9-1.el7.x86_64.rpm
-rw-r--r--. 1 root root 1157280 Mar 15 09:47 slurm-slurmctld-18.08.9-1.el7.x86_64.rpm
-rw-r--r--. 1 root root 626380 Mar 15 09:47 slurm-slurmd-18.08.9-1.el7.x86_64.rpm
-rw-r--r--. 1 root root 670392 Mar 15 09:47 slurm-slurmdbd-18.08.9-1.el7.x86_64.rpm
-rw-r--r--. 1 root root 140940 Mar 15 09:47 slurm-libpmi-18.08.9-1.el7.x86_64.rpm
-rw-r--r--. 1 root root 116664 Mar 15 09:47 slurm-torque-18.08.9-1.el7.x86_64.rpm
-rw-r--r--. 1 root root 8476 Mar 15 09:47 slurm-openlava-18.08.9-1.el7.x86_64.rpm
-rw-r--r--. 1 root root 16612 Mar 15 09:47 slurm-contribs-18.08.9-1.el7.x86_64.rpm
-rw-r--r--. 1 root root 147612 Mar 15 09:47 slurm-pam_slurm-18.08.9-1.el7.x86_64.rpm
-rwxr-xr-x. 1 root root 198400 Mar 15 09:51 job_submit_cyclecloud_centos_18.08.9-1.so
-rwxr-xr-x. 1 root root 198400 Mar 15 09:51 job_submit_cyclecloud_ubuntu_18.08.9-1.so

Slurm 3.0.1 dynamic partition gres.conf uses entire core allotment

CC = 8.4.0
Slurm Cluster-init = 3.0.1
Slurm version = 22.05.8-1

ISSUE
using a dynamic partition with multiple VM types, including GPU, will result in gres.conf built using the entire allotment of cores for the nodearray even if the defined GPU is a subset of the total.

STEPS TO REPRODUCE

  1. create a custom cluster with dynamic nodearray limited to 100 cores and Config.Multiselect = true
  2. select multiple VM types for the dynamic partition and start the cluster
  3. run scontrol create nodename.. for autoscaling the nodes in Slurm...for example:
scontrol create NodeName=jm-slurm-multi-low-[1-10] CPUs=2 Boards=1 SocketsPerBoard=1 CoresPerSocket=2 ThreadsPerCore=1 RealMemory=3072 Feature=dyn,Standard_F2s_v2 State=CLOUD
scontrol create NodeName=jm-slurm-multi-mid-[1-10] CPUs=2 Boards=1 SocketsPerBoard=1 CoresPerSocket=2 ThreadsPerCore=1 RealMemory=7168 Feature=dyn,Standard_D2ds_v4 State=CLOUD
scontrol create NodeName=jm-slurm-multi-high-[1-15] CPUs=2 Boards=1 SocketsPerBoard=1 CoresPerSocket=2 ThreadsPerCore=1 RealMemory=15360 Feature=dyn,Standard_E2ds_v4 State=CLOUD
scontrol create NodeName=jm-slurm-multi-gpu-[1-5] CPUs=6 Boards=1 SocketsPerBoard=1 CoresPerSocket=6 ThreadsPerCore=1 RealMemory=54476 Gres=gpu:1 Feature=dyn,Standard_NC6_Promo State=CLOUD
  1. run azslurm scale to build the gres.conf
  2. view /etc/slurm/gres.conf
root@jm-slurm-multi-hn:~# cat /etc/slurm/gres.conf

Nodename=jm-slurm-mutli-dynamic-[1-16] Name=gpu Count=1 File=/dev/nvidia0

WORKAROUND
Manually update the gres.conf for both hostname and quantity

Question regarding /sched default size

I noticed that by default /sched mount in the slurm.txt template has 1T of memory. But as far as I understood it only contained some slurm and cyclecloud configurations. Why does it need all this memory then?

Node doesn't match existing scaleset when you add a new node to a node array

CycleCloud - 8.1.0-1275
Slurm - 19.05.8-1

Scenario:

  1. Submit a job to the HPC node array.
  2. Increase max core count from UI and run scale (cyclecloud_slurm.sh scale) command.
  3. Node which is in the acquiring state stuck with the below message:

{{{
This node does not match existing scaleset attributes: Configuration.cyclecloud.mounts.additional_nfs.export_path, Configuration.cyclecloud.mounts.additional_nfs.mountpoint
}}}

Scaleset mismatch
Node attributes - Good Node
Node attributes - Failed Node

documentation is not clear

can you please update the documentation so that it explains how and where to clone this repo ?
it is said that you have to change directory to the slurm directory, but where is that one ?

Problem using the default image

I'm not sure if I'm doing something wrong, but I'm utterly failing to use the default Slurm image to do some basic MPI.

I've stood up a simple cyclecloud-slurm setup with the default image (Cycle CentOS 7) for all node types.

I can run a simple single node MPI job as long as I use Intel MPI:

srun -n2 mpirun ./hello

That works fine.

But as soon as I try to use more than one node, I get all sorts of infiniband related error messages.

If I try to use a machinetype that supports Infiniband I end up with no functioning MPI at all, as it appears to not get installed.

Try even a single node MPI test with OpenMPI and I get errors possibly related to hostname issues:

--------------------------------------------------------------------------
ORTE has lost communication with a remote daemon.

  HNP daemon   : [[22239,0],0] on node ip-0A000406
  Remote daemon: [[22239,0],1] on node hpc-1

Looks like OpenMPI is working with the Slurm node name (hpc-1) rather than that actual hostname (ip-0A000406), and then possibly getting upset as a result?

If I've missed something really obvious, please do point it out :)

Thanks,

John

each job starts a virtual machine

I'm wondering why each job submission ends up in it's dedicated virtual machine although slots are available on e.g. the node of the first job.

Confusing PCPU vs VCPU behavior

  1. We need to make this project's use of physical vs virtual cores and how those are represented configurable.

  2. CoresPerSocket does not seem to be calculated correctly. Here is a 32 vcpu, 16 pcpu VM but does not use 2 for the CoresPerSocket
    Nodename=htc-[1-4] Feature=cloud STATE=CLOUD CPUs=16 CoresPerSocket=1 RealMemory=124518

Enable Hyperthreading Support using threads

The latest release switched from vcpu to pcpu, so hyperthreading is no longer captured in the configuration. Suggest using threads to allow 2 threads per core/cpu on VMs that support it.

`slurmd_sysconfig` may not work for Debian users when `slurm.install` is false

Native Debian (and also Ubuntu) use /etc/default/slurmd/ as system config path, instead of RHEL's /etc/sysconfig/slurmd.
Therefore, if user use native Slurm, the following code may useless.

directory '/etc/sysconfig' do
action :create
end
file '/etc/sysconfig/slurmd' do
content slurmd_sysconfig
mode '0700'
owner 'slurm'
group 'slurm'
end

An fix can be, when node['slurm']['install'] == true and /etc/default/slurmd/ directory exists, write file into this directory instead.

change slurm installation path

Hi,
is there any option to change the SLURM installation path?
The benefit would be syncing it with onpremise deployments which makes transition easier.

Thanks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.