Giter VIP home page Giter VIP logo

cyclecloud-slurm's Introduction

Slurm

This project sets up an auto-scaling Slurm cluster Slurm is a highly configurable open source workload manager. See the Slurm project site for an overview.

Slurm Clusters in CycleCloud versions < 8.4.0

See Transitioning from 2.7 to 3.0 for more information.

Making Cluster Changes

The Slurm cluster deployed in CycleCloud contains a cli called azslurm which facilitates this. After making any changes to the cluster, run the following command as root on the Slurm scheduler node to rebuild the azure.conf and update the nodes in the cluster:

      $ sudo -i
      # azslurm scale

This should create the partitions with the correct number of nodes, the proper gres.conf and restart the slurmctld.

No longer pre-creating execute nodes

As of 3.0.0, we are no longer pre-creating the nodes in CycleCloud. Nodes are created when azslurm resume is invoked, or by manually creating them in CycleCloud via CLI etc.

Creating additional partitions

The default template that ships with Azure CycleCloud has three partitions (hpc, htc and dynamic), and you can define custom nodearrays that map directly to Slurm partitions. For example, to create a GPU partition, add the following section to your cluster template:

   [[nodearray gpu]]
   MachineType = $GPUMachineType
   ImageName = $GPUImageName
   MaxCoreCount = $MaxGPUExecuteCoreCount
   Interruptible = $GPUUseLowPrio
   AdditionalClusterInitSpecs = $ExecuteClusterInitSpecs

      [[[configuration]]]
      slurm.autoscale = true
      # Set to true if nodes are used for tightly-coupled multi-node jobs
      slurm.hpc = false

      # Optionally over-ride the Device File locations for gres.conf 
      # (The example here shows the default for an NVidia sku with 8 GPUs)
      # slurm.gpu_device_config = /dev/nvidia[0-7]

      [[[cluster-init cyclecloud/slurm:execute:3.0.8]]]
      [[[network-interface eth0]]]
      AssociatePublicIpAddress = $ExecuteNodesPublic

Dynamic Partitions

In cyclelcoud slurm projects >= 3.0.1, we support dynamic partitions. You can make a nodearray map to a dynamic partition by adding the following. Note that mydyn could be any valid Feature. It could also be more than one, separated by a comma.

      [[[configuration]]]
      slurm.autoscale = true
      # Set to true if nodes are used for tightly-coupled multi-node jobs
      slurm.hpc = false
      # This is the minimum, but see slurmd --help and [slurm.conf](https://slurm.schedmd.com/slurm.conf.html) for more information.
      slurm.dynamic_config := "-Z --conf \"Feature=mydyn\""

This will generate a dynamic partition like the following

# Creating dynamic nodeset and partition using slurm.dynamic_config=-Z --conf "Feature=mydyn"
Nodeset=mydynamicns Feature=mydyn
PartitionName=mydynamic Nodes=mydynamicns

Using Dynamic Partitions to Autoscale

By default, we define no nodes in the dynamic partition.

You can pre-create node records like so, which allows Slurm to autoscale them up.

scontrol create nodename=f4-[1-10] Feature=mydyn,Standard_F2s_V2 cpus=2 State=CLOUD

One other advantage of dynamic partitions is that you can support multiple VM sizes in the same partition. Simply add the VM Size name as a feature, and then azslurm can distinguish which VM size you want to use.

Note The VM Size is added implicitly. You do not need to add it to slurm.dynamic_config

scontrol create nodename=f4-[1-10] Feature=mydyn,Standard_F4 State=CLOUD
scontrol create nodename=f8-[1-10] Feature=mydyn,Standard_F8 State=CLOUD

Either way, once you have created these nodes in a State=Cloud they are now available to autoscale like other nodes.

Multiple VM_Sizes are supported by default for dynamic partitions, and that is configured via Config.Multiselect field in slurm template as shown here:

        [[[parameter DynamicMachineType]]]
        Label = Dyn VM Type
        Description = The VM type for Dynamic nodes
        ParameterType = Cloud.MachineType
        DefaultValue = Standard_F2s_v2
        Config.Multiselect = true

Note for slurm 23.11.7 users:

Dynamic partition behaviour has changed in new version of Slurm 23.11.7. When adding dynamic nodes containing GRES such as gpu's, the /etc/slurm/gres.conf file needs to be modified first before running scontrol create nodename command. If this is not done, slurm will report invalid nodename like shown here:

root@s3072-scheduler:~# scontrol create NodeName=e1 CPUs=24 Gres=gpu:4 Feature=dyn,nv24 State=cloud
scontrol: error: Invalid argument (e1)
Error creating node(s): Invalid argument

Simply add the node e1 in /etc/slurm/gres.conf and then the command will work.

Dynamic Scaledown

By default, all nodes in the dynamic partition will scale down just like the other partitions. To disable this, see SuspendExcParts.

Manual scaling

If cyclecloud_slurm detects that autoscale is disabled (SuspendTime=-1), it will use the FUTURE state to denote nodes that are powered down instead of relying on the power state in Slurm. i.e. When autoscale is enabled, off nodes are denoted as idle~ in sinfo. When autoscale is disabled, the off nodes will not appear in sinfo at all. You can still see their definition with scontrol show nodes --future.

To start new nodes, run /opt/azurehpc/slurm/resume_program.sh node_list (e.g. htc-[1-10]).

To shutdown nodes, run /opt/azurehpc/slurm/suspend_program.sh node_list (e.g. htc-[1-10]).

To start a cluster in this mode, simply add SuspendTime=-1 to the additional slurm config in the template.

To switch a cluster to this mode, add SuspendTime=-1 to the slurm.conf and run scontrol reconfigure. Then run azslurm remove_nodes && azslurm scale.

Accounting

To enable accounting in slurm, maria-db can now be started via cloud-init on the scheduler node and slurmdbd configured to enable db connection without a password string. In the absense of database URL and password, slurmdbd configuration defaults to localhost. One way of doing this is to add following lines in cluster-init:

#!/bin/bash
yum install -y mariadb-server
systemctl enable mariadb.service
systemctl start mariadb.service
mysql --connect-timeout=120 -u root -e "ALTER USER root@localhost IDENTIFIED VIA mysql_native_password ; FLUSH privileges;"

AzureCA.pem and existing MariaDB/MySQL instances

In previous versions, we shipped with an embedded certificate to connect to Azure MariaDB and Azure MySQL instances. This is no longer required. However, if you wish to restore this behavior, select the 'AzureCA.pem' option from the dropdown for the 'Accounting Certificate URL' parameter in your your cluster settings.

Cost Reporting

azslurm in slurm 3.0 project now comes with a new experimental feature azslurm cost to display costs of slurm jobs. This requires Cyclecloud 8.4 or newer, as well as slurm accounting enabled.

usage: azslurm cost [-h] [--config CONFIG] [-s START] [-e END] -o OUT [-f FMT]

optional arguments:
  -h, --help            show this help message and exit
  --config CONFIG, -c CONFIG
  -s START, --start START
                        Start time period (yyyy-mm-dd), defaults to current
                        day.
  -e END, --end END     End time period (yyyy-mm-dd), defaults to current day.
  -o OUT, --out OUT     Directory name for output CSV
  -f FMT, --fmt FMT     Comma separated list of SLURM formatting options.
                        Otherwise defaults are applied

Cost reporting at the moment only works with retail azure pricing, and hence may not reflect actual customer invoices.

To generate cost reports for a given time period:

 azslurm cost -s 2023-03-01 -e 2023-03-31 -o march-2023

This will create a directory march-2023 and generate csv files containing costs for jobs and partitions.

[root@slurm301-2-scheduler ~]# ls march-2023/
jobs.csv  partition.csv  partition_hourly.csv
  1. jobs.csv : contains costs per job based on jobs runtime. Currently running jobs are included.
  2. partition.csv: contains costs per partition, based total usage in each partition. For partitions, such as dynamic partitions where multiple VM sizes can be included, it includes a row for each VM size.
  3. partition_hourly.csv: contains csv report for each partition on an hourly basis.

Some basic formatting support includes customizing fields in the jobs report that are appended from sacct data. Cost reporting fields such as sku_name,region,spot,meter,meterid,metercat,rate,currency,cost are always appended but slurm fields from sacct can be customizable. Any field available in sacct -e is valid. To customize formatting:

azslurm cost -s 2023-03-01 -e 2023-03-31 -o march-2023 -f account,cluster,jobid,jobname,reqtres,start,end,state,qos,priority,container,constraints,user

This will append the supplied formatting options to cost reporting fields, and produce the jobs csv file with following columns:

account,cluster,jobid,jobname,reqtres,start,end,state,qos,priority,container,constraints,user,sku_name,region,spot,meter,meterid,metercat,rate,currency,cost

Formatting is only available for jobs and not for partition and partition_hourly data.

Do note: azslurm cost relies on slurm's admincomment feature to associate specific vm_size and meter info for jobs.

Troubleshooting

UID conflicts for Slurm and Munge users

By default, this project uses a UID and GID of 11100 for the Slurm user and 11101 for the Munge user. If this causes a conflict with another user or group, these defaults may be overridden.

To override the UID and GID, click the edit button for both the scheduler node:

Alt

And for each nodearray, for example the htc array: Alt

and add the following attributes at the end of the Configuration section:

Alt

Incorrect number of GPUs

For some regions and VM sizes, some subscriptions may report an incorrect number of GPUs. This value is controlled in /opt/azurehpc/slurm/autoscale.json

The default definition looks like the following:

  "default_resources": [
    {
      "select": {},
      "name": "slurm_gpus",
      "value": "node.gpu_count"
    }
  ],

Note that here it is saying "For all VM sizes in all nodearrays, create a resource called slurm_gpus with the value of the gpu_count CycleCloud is reporting".

A common solution is to add a specific override for that VM size. In this case, 8 GPUs. Note the ordering here is critical - the blank select statement will set the default for all possible VM sizes and all other definitions will be ignored. For more information on how scalelib default_resources work, the underlying library used in all CycleCloud autoscalers, see the ScaleLib documentation

  "default_resources": [
    {
      "select": {"node.vm_size": "Standard_XYZ"},
      "name": "slurm_gpus",
      "value": 8
    },
    {
      "select": {},
      "name": "slurm_gpus",
      "value": "node.gpu_count"
    }
  ],

Simply run azslurm scale again for the changes to take effect. Note that if you need to iterate on this, you may also run azslurm partitions, which will write the partition definition out to stdout. This output will match what is in /etc/slurm/azure.conf after azslurm scale is run.

Dampening Memory

Slurm requires that you define the amount of free memory, after OS/Applications are considered, when reporting memory as a resource. If the reported memory is too low, then Slurm will reject this node. To overcome this, by default we dampen the memory by 5% or 1g, whichever is larger.

To change this dampening, there are two options.

  1. You can define slurm.dampen_memory=X where X is an integer percentage (5 == 5%)
  2. Create a default_resource definition in the /opt/azurehpc/slurm/autoscale.json file.
  "default_resources": [
  {
    "select": {},
    "name": "slurm_memory",
    "value": "node.memory"
  }
],

Default resources are a powerful tool that the underlying library ScaleLib provides. see the ScaleLib documentation

Note: slurm.dampen_memory takes precedence, so the default_resource slurm_memory will be ignored if slurm.dampen_memory is defined.

KeepAlive set in CycleCloud and Zombie nodes

If you choose to set KeepAlive=true in CycleCloud, then Slurm will still change its internal power state to powered_down. At this point, that node is now a zombie node. A zombie node is one where it exists in CycleCloud but is in a powered_down state in Slurm.

Previous to 3.0.7, Slurm would try and fail to resume zombie nodes over and over again. As of 3.0.7, the zombie node will be left in a down~ (or drained~). If you want the zombie node to rejoin the cluster, g=you must log into it and restart the slurmd, typically via systemctl restart slurmd. If you want these nodes to be terminated, you can either manually terminate them via the UI or azslurm suspend, or to do this automatically, you can add the following to the autoscale.json file found at /opt/azurehpc/slurm/autoscale.json

This will change the behavior of the azslurm return_to_idle command that is, by default, run as a cronjob every 5 minutes. You can also execute it manually, with the argument --terminate-zombie-nodes.

{
  "return-to-idle": {
    "terminate-zombie-nodes": true
  }
}

Transitioning from 2.7 to 3.0

  1. The installation folder changed /opt/cycle/slurm -> /opt/azurehpc/slurm

  2. Logs are now in /opt/azurehpc/slurm/logs instead of /var/log/slurmctld. Note, slurmctld.log will still be in this folder.

  3. cyclecloud_slurm.sh no longer exists. Instead there is the azslurm cli, which can be run as root. azslurm uses autocomplete.

    [root@scheduler ~]# azslurm
    usage: 
    accounting_info      - 
    buckets              - Prints out autoscale bucket information, like limits etc
    config               - Writes the effective autoscale config, after any preprocessing, to stdout
    connect              - Tests connection to CycleCloud
    cost                 - Cost analysis and reporting tool that maps Azure costs to SLURM Job Accounting data. This is an experimental feature.
    default_output_columns - Output what are the default output columns for an optional command.
    generate_topology    - Generates topology plugin configuration
    initconfig           - Creates an initial autoscale config. Writes to stdout
    keep_alive           - Add, remeove or set which nodes should be prevented from being shutdown.
    limits               - 
    nodes                - Query nodes
    partitions           - Generates partition configuration
    refresh_autocomplete - Refreshes local autocomplete information for cluster specific resources and nodes.
    remove_nodes         - Removes the node from the scheduler without terminating the actual instance.
    resume               - Equivalent to ResumeProgram, starts and waits for a set of nodes.
    resume_fail          - Equivalent to SuspendFailProgram, shutsdown nodes
    retry_failed_nodes   - Retries all nodes in a failed state.
    scale                - 
    shell                - Interactive python shell with relevant objects in local scope. Use --script to run python scripts
    suspend              - Equivalent to SuspendProgram, shutsdown nodes
    wait_for_resume      - Wait for a set of nodes to converge.
  4. Nodes are no longer pre-populated in CycleCloud. They are only created when needed.

  5. All slurm binaries are inside the azure-slurm-install-pkg*.tar.gz file, under slurm-pkgs. They are pulled from a specific binary release. The current binary releases is 2023-08-07

  6. For MPI jobs, the only network boundary that exists by default is the partition. There are not multiple "placement groups" per partition like 2.x. So you only have one colocated VMSS per partition. There is also no use of the topology plugin, which necessitated the use of a job submission plugin that is also no longer needed. Instead, submitting to multiple partitions is now the recommended option for use cases that require submitting jobs to multiple placement groups.

Ubuntu 22 or greater and DNS hostname resolution

Due to an issue with the underlying DNS registration scheme that is used across Azure, our Slurm scripts use a mitigation that involves restarting systemd-networkd when changing the hostname of VMs deployed in a VMSS. This mitigation can be disabled by adding the following to your Configuration section.

      [[[configuration]]]
      slurm.ubuntu22_waagent_fix = false

In future releases, this mitigation will be disabled by default when the issue is resolved in waagent.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.microsoft.com.

When you submit a pull request, a CLA-bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

cyclecloud-slurm's People

Contributors

aditigaur4 avatar anhoward avatar atomic-penguin avatar bwatrous avatar dependabot[bot] avatar diodfr avatar dougclayton avatar dpwatrous avatar edwardsp avatar egmsft avatar jamesongithub avatar jermth avatar microsoft-github-policy-service[bot] avatar microsoftopensource avatar msftgits avatar ryanhamel avatar staer avatar themorey avatar wolfgang-desalvador avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cyclecloud-slurm's Issues

Confusing PCPU vs VCPU behavior - Standard_F32s_v2

Hello,

I have configured a partition with nodetype Standard_F32s_v2 (or different CPUs).
Cyclecloud-slurm.sh (v2.5.1) creates an "invalid configuration" compared to the output of "slurmd -C" (v20.11.8).

cyclecloud:
Feature=cloud STATE=CLOUD CPUs=16 ThreadsPerCore=2 RealMemory=62914

slurmd -C
CPUs=32 Boards=1 SocketsPerBoard=1 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=64401

documentation is not clear

can you please update the documentation so that it explains how and where to clone this repo ?
it is said that you have to change directory to the slurm directory, but where is that one ?

ResumeTimeout leaves nodes in down~ state

Nodes that do not start within ResumeTimeout (default is 10 minutes) enter the down~ state and will not come out of this. I recommend enabling return_to_idle.sh on all versions of Slurm, not just <= 18.

Slurm nodenames do not match azure hostnames - so head node cannot communicate with nodes

This is my first stab at setting up Azure HPC using CycleCloud and Slurm, so forgive me for stupid mistakes...

Also, if this is the wrong place for this - please point me to where I should post support questions...

I have built a simple (default) Slurm cluster using CycleCloud and the nodes start/stop OK, but when I run a simple (hostname) job it just hangs at "Completing".

Debug logging seems to suggest that the Slurm Scheduler node cannot communicate with the nodes:

[azccadmin@SlurmCluster-1-scheduler data]$ srun -N2 -n2 -t00:15:00 -Jgrma-hostname hostname.sh
srun: error: get_addr_info: getaddrinfo() failed: Name or service not known
srun: error: slurm_set_addr: Unable to resolve "slurmcluster-1-hpc-pg0-1"
srun: error: fwd_tree_thread: can't find address for host slurmcluster-1-hpc-pg0-1, check slurm.conf
srun: error: get_addr_info: getaddrinfo() failed: Name or service not known
srun: error: slurm_set_addr: Unable to resolve "slurmcluster-1-hpc-pg0-2"
srun: error: fwd_tree_thread: can't find address for host slurmcluster-1-hpc-pg0-2, check slurm.conf
srun: error: Task launch for StepId=1.0 failed on node slurmcluster-1-hpc-pg0-2: Can't find an address, check slurm.conf
srun: error: Task launch for StepId=1.0 failed on node slurmcluster-1-hpc-pg0-1: Can't find an address, check slurm.conf
srun: error: Application launch failed: Can't find an address, check slurm.conf
srun: Job step aborted

I cannot ping the node from the Scheduler node:

[azccadmin@SlurmCluster-1-scheduler ~]$ ping slurmcluster-1-hpc-pg0-1
ping: slurmcluster-1-hpc-pg0-1: Name or service not known

Slurm is trying to talk to the node as "slurmcluster-1-hpc-pg0-1", but the hostname of the node is actually "slurmcluster-1-slurmcluster-1-hpc-pg0-1"

And I can ping it with this name from the Scheduler node:

[azccadmin@SlurmCluster-1-scheduler data]$ ping slurmcluster-1-slurmcluster-1-hpc-pg0-1
PING slurmcluster-1-slurmcluster-1-hpc-pg0-1.yr5ran05dk5uzlz13kqz0cq4xe.ax.internal.cloudapp.net (192.168.140.6) 56(84) bytes of data.
64 bytes from slurmcluster-1-slurmcluster-1-hpc-pg0-1.internal.cloudapp.net (192.168.140.6): icmp_seq=1 ttl=64 time=1.77 ms

I am using CycleCloud 8.3 - which I note has a fix for Slurm NodeName / Azure Hostname (but this seems to still be an issue)?

Thanks

Gary

Version mismatch between docker-rpmbuild and 00-build-slurm.sh

Hello.

I think there is a small bug in the 2.1.0 release.

when i do the following:

git clone
git checkout 2.1.0
git pull origin 2.1.0 (to be sure :P)
cd cyclecloud-slurm
./docker-rpmbuild.sh

the container starts and builds the *.rpm and *.deb files into /blobs.
When i try to upload or build the project (cyclecloud project upload / cyclecoud project build), i get a file not found error looking for files in ./blobs that have a mismatched version:

project.ini shows the following defined in blobs section (note the versions):

[blobs]
Files = cyclecloud-api-7.9.2.tar.gz, job_submit_cyclecloud_centos_18.08.9-1.so, slurm-18.08.9-1.el7.x86_64.rpm, slurm-contribs-18.08.9-1.el7.x86_64.rpm, etc, etc, etc

however, after running docker-buildrpm.sh, and looking in ./blobs, here is the file listing:

[root@cyclecloud blobs]# ls -tlr
total 77336
-rw-r--r--. 1 root root 16362 Mar 5 19:23 cyclecloud-api-7.9.2.tar.gz
-rw-r--r--. 1 root root 13426172 Mar 15 09:24 slurm-18.08.8-1.el7.x86_64.rpm
-rw-r--r--. 1 root root 796360 Mar 15 09:24 slurm-perlapi-18.08.8-1.el7.x86_64.rpm
-rw-r--r--. 1 root root 78540 Mar 15 09:24 slurm-devel-18.08.8-1.el7.x86_64.rpm
-rw-r--r--. 1 root root 4996 Mar 15 09:24 slurm-example-configs-18.08.8-1.el7.x86_64.rpm
-rw-r--r--. 1 root root 1157424 Mar 15 09:24 slurm-slurmctld-18.08.8-1.el7.x86_64.rpm

_<as well, the files

cyclecloud-api-7.9.2.tar.gz
job_submit_cyclecloud_ubuntu_19.05.5-1.so
job_submit_cyclecloud_centos_19.05.5-1.so

are not seemed to be built by the container with the docker-buildrpm.sh- i had to fetch it from the github releases page (wget)>_

When i changed the VERSION in the 00-build-slurm.sh script to 18.08.9 and re-ran the docker-rpmbuild.sh script, i can see the files are built with the version matching what the project.ini is looking for.

Im not 100% sure this is a bug (if so i can do a PR if needed, seems simple) or if im missing something in the steps to build/deploy this project to locker.

Thanks,
Daniel

root@cyclecloud blobs]# ls -ltr
-rw-r--r--. 1 root root 13424068 Mar 15 09:47 slurm-18.08.9-1.el7.x86_64.rpm
-rw-r--r--. 1 root root 795780 Mar 15 09:47 slurm-perlapi-18.08.9-1.el7.x86_64.rpm
-rw-r--r--. 1 root root 78540 Mar 15 09:47 slurm-devel-18.08.9-1.el7.x86_64.rpm
-rw-r--r--. 1 root root 5000 Mar 15 09:47 slurm-example-configs-18.08.9-1.el7.x86_64.rpm
-rw-r--r--. 1 root root 1157280 Mar 15 09:47 slurm-slurmctld-18.08.9-1.el7.x86_64.rpm
-rw-r--r--. 1 root root 626380 Mar 15 09:47 slurm-slurmd-18.08.9-1.el7.x86_64.rpm
-rw-r--r--. 1 root root 670392 Mar 15 09:47 slurm-slurmdbd-18.08.9-1.el7.x86_64.rpm
-rw-r--r--. 1 root root 140940 Mar 15 09:47 slurm-libpmi-18.08.9-1.el7.x86_64.rpm
-rw-r--r--. 1 root root 116664 Mar 15 09:47 slurm-torque-18.08.9-1.el7.x86_64.rpm
-rw-r--r--. 1 root root 8476 Mar 15 09:47 slurm-openlava-18.08.9-1.el7.x86_64.rpm
-rw-r--r--. 1 root root 16612 Mar 15 09:47 slurm-contribs-18.08.9-1.el7.x86_64.rpm
-rw-r--r--. 1 root root 147612 Mar 15 09:47 slurm-pam_slurm-18.08.9-1.el7.x86_64.rpm
-rwxr-xr-x. 1 root root 198400 Mar 15 09:51 job_submit_cyclecloud_centos_18.08.9-1.so
-rwxr-xr-x. 1 root root 198400 Mar 15 09:51 job_submit_cyclecloud_ubuntu_18.08.9-1.so

Chef error while upgrading to CycleCloud Slurm 2.4.8

My current versions are:

CycleCloud: 8.2.0-1616
Slurm: 20.11.7-1
CycleCloud-Slurm: 2.4.7
OS: CentOS Linux release 7.8.2003 (Core)

While upgrading CycleCloud-Slurm to version 2.4.8 I encountered the following error on the scheduler node while starting my cluster. I know this is a prerelease, but just wanted to make you aware of it. Hope this helps :-)

Chef::Mixin::Template::TemplateError: undefined method `[]' for nil:NilClass
Software Configuration
Review local log files on the VM at /opt/cycle/jetpack/logs
Get more help on this issue
Detail:

/opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/mixin/template.rb:163:in rescue in _render_template' /opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/mixin/template.rb:159:in _render_template'
/opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/mixin/template.rb:147:in render_template' /opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/provider/template/content.rb:76:in file_for_provider'
/opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/file_content_management/content_base.rb:40:in tempfile' /opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/provider/file.rb:450:in tempfile'
/opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/provider/file.rb:327:in do_generate_content' /opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/provider/file.rb:140:in action_create'
/opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/provider/file.rb:152:in action_create_if_missing' /opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/provider.rb:171:in run_action'
/opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/resource.rb:592:in run_action' /opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/runner.rb:70:in run_action'
/opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/runner.rb:98:in block (2 levels) in converge' /opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/runner.rb:98:in each'
/opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/runner.rb:98:in block in converge' /opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/resource_collection/resource_list.rb:94:in block in execute_each_resource'
/opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/resource_collection/stepable_iterator.rb:114:in call_iterator_block' /opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/resource_collection/stepable_iterator.rb:85:in step'
/opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/resource_collection/stepable_iterator.rb:103:in iterate' /opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/resource_collection/stepable_iterator.rb:55:in each_with_index'
/opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/resource_collection/resource_list.rb:92:in execute_each_resource' /opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/runner.rb:97:in converge'
/opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/client.rb:718:in block in converge' /opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/client.rb:713:in catch'
/opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/client.rb:713:in converge' /opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/client.rb:752:in converge_and_save'
/opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/client.rb:286:in run' /opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/application.rb:292:in block in fork_chef_client'
/opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/application.rb:280:in fork' /opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/application.rb:280:in fork_chef_client'
/opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/application.rb:245:in block in run_chef_client' /opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/local_mode.rb:44:in with_server_connectivity'
/opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/application.rb:233:in run_chef_client' /opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/application/client.rb:470:in sleep_then_run_chef_client'
/opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/application/client.rb:459:in block in interval_run_chef_client' /opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/application/client.rb:458:in loop'
/opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/application/client.rb:458:in interval_run_chef_client' /opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/application/client.rb:442:in run_application'
/opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/application.rb:59:in run' /opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/lib/chef/application/solo.rb:225:in run'
/opt/cycle/jetpack/system/embedded/lib/ruby/gems/2.5.0/gems/chef-13.12.14/bin/chef-solo:25:in <top (required)>' /opt/cycle/jetpack/system/embedded/bin/chef-solo:23:in load'
/opt/cycle/jetpack/system/embedded/bin/chef-solo:23:in `

'

1 node with this status

SLURM 2.4.8 pmix libs are not included

SLURM 2.4.8 is compiled with pmix support so MPI libraries will interface correctly with SLURM srun.
It seems that SLURM 2.4.8 expects the pmix libs to be install on the compute nodes at /opt/pmix/v3, but these do not exist and need to be built (e.g cluster-init project).

#!/bin/bash

cd ~/
mkdir -p /opt/pmix/v3
apt install -y libevent-dev
tar xvf $CYCLECLOUD_SPEC_PATH/files/openpmix-3.1.6.tar.gz
cd openpmix-3.1.6
#mkdir -p pmix/build/v3 pmix/install/v3
#cd pmix
#git clone https://github.com/openpmix/openpmix.git source
#cd source/
#git branch -a
#git checkout v3.1
#git pull
./autogen.sh
#cd ../build/v3/
./configure --prefix=/opt/pmix/v3
make -j install >/dev/null

Can there pmix libs be included with SLURM versions built with pmix support (SLURM 2.4.8+)?

Autoscaling doesn't work for pending jobs

I am using cyclecloud with a SLURM queueing system in combination with dask (a Python package that manages the bookkeeping and task distribution such that one doesn't have to write jobscripts and collect the data.)

It's possible to use that adaptive scaling feature of dask which means, that as the load on all my nodes becomes high, it automatically creates new jobs using (in my case) the following job script:

#!/bin/bash

#!/usr/bin/env bash
#SBATCH -J dask-worker
#SBATCH -e TMPDIR/dask-worker-%J.err
#SBATCH -o TMPDIR/dask-worker-%J.out
#SBATCH -p debug
#SBATCH -A WAL
#SBATCH -n 1
#SBATCH --cpus-per-task=1
#SBATCH --mem=15G
#SBATCH -t 240:00:00
JOB_ID=${SLURM_JOB_ID%;*}

export MKL_NUM_THREADS=1
export OPENBLAS_NUM_THREADS=1
export OMP_NUM_THREADS=1

/gscratch/home/t-banij/miniconda3/envs/py37_min/bin/python -m distributed.cli.dask_worker tcp://10.75.64.5:43267 --nthreads 1 --memory-limit 16.00GB --name dask-worker--${JOB_ID}-- --death-timeout 60 --local-directory TMPDIR

This results in many PENDING jobs. Unfortunately, no new nodes are started automatically, I still need to go into the web interface and start new nodes by hand, rendering the autoscale feature I am using useless.

Seems related to #1, #5, and #9.

slurm-libpmi is missing

Hi all,
I am trying to install openmpi with slurm support on cyclecloud:

spack install hpl^openmpi+pmi schedulers=slurm

But I get this error:

'/configure' '--prefix=/mnt/exports/shared/bin/spack/opt/spack/linux-centos7-broadwell/gcc-8.3.1/openmpi-3.1.5-3uku7u7irjzvpxy5uwajd34ksg5vyyim' '--enable-shared' '--with-wrapper-ldflags=' '--with-pmi=/usr' '--enable-static' '--enable-mpi-cxx' '--with-zlib=/mnt/exports/shared/bin/spack/opt/spack/linux-centos7-broadwell/gcc-8.3.1/zlib-1.2.11-7zv5elkj6r5xcrw4mifho5mfhi6wuwpq' '--without-psm' '--without-libfabric' '--without-ucx' '--without-mxm' '--without-verbs' '--without-psm2' '--without-alps' '--without-lsf' '--without-sge' '--with-slurm' '--without-tm' '--without-loadleveler' '--disable-memchecker' '--with-hwloc=/mnt/exports/shared/bin/spack/opt/spack/linux-centos7-broadwell/gcc-8.3.1/hwloc-1.11.11-7tt3mdset5tqvje4uaonjdqm2b3koihp' '--disable-java' '--disable-mpi-java' '--without-cuda' '--enable-cxx-exceptions'

1 error found in build log:
     1125    configure: WARNING:     /usr/slurm
     1126    configure: WARNING: Specified path: /usr
     1127    configure: WARNING: OR neither libpmi nor libpmi2 were found under
             :
     1128    configure: WARNING:     /lib
     1129    configure: WARNING:     /lib64
     1130    configure: WARNING: Specified path:
  >> 1131    configure: error: Aborting

It is a very similar problem to: aws/aws-parallelcluster#1008
PMI has been split due to this bug: https://bugs.schedmd.com/show_bug.cgi?id=4511

May I ask to add the package slurm-libpmi?

Thanks

each job starts a virtual machine

I'm wondering why each job submission ends up in it's dedicated virtual machine although slots are available on e.g. the node of the first job.

Some modern slurm features not supported

With the currently supported version of Slurm, the slurmrestd service can be built by the addition of a few dependencies in the rpmbuild stage and the addition of an rpmbuild flag:

RUN yum install -y http-parser-devel json-c-devel
RUN rpmbuild  --define "_with_slurmrestd 1" -ta ${SLURM_PKG}

How this would then be added to chef/cloud-init is unfortunately beyond me!

There are also several cloud focused developments (such as burstbuffer/lua for staging of files which is useful when using cloud compute resources) and commonly required features (job script capturing, parsable account queries, etc) which are unavailable due to them only being available in Slurm 21.08.

Is there a timeline for when the slurm version will be bumped to the latest, and is there any scope for enabling additional features like slurmrestd?

change slurm installation path

Hi,
is there any option to change the SLURM installation path?
The benefit would be syncing it with onpremise deployments which makes transition easier.

Thanks

Race condition when two jobs request the same node

For jobs that can run on or have requested the same node, there's a race condition where the second job to request the node may fail with a communication error.

Working on a workaround to get the job to block until the node is completely up and registered with Slurm. It's not clear yet if this is a bug in Slurm or an issue with the Chef recipe order of operations.

Problem using the default image

I'm not sure if I'm doing something wrong, but I'm utterly failing to use the default Slurm image to do some basic MPI.

I've stood up a simple cyclecloud-slurm setup with the default image (Cycle CentOS 7) for all node types.

I can run a simple single node MPI job as long as I use Intel MPI:

srun -n2 mpirun ./hello

That works fine.

But as soon as I try to use more than one node, I get all sorts of infiniband related error messages.

If I try to use a machinetype that supports Infiniband I end up with no functioning MPI at all, as it appears to not get installed.

Try even a single node MPI test with OpenMPI and I get errors possibly related to hostname issues:

--------------------------------------------------------------------------
ORTE has lost communication with a remote daemon.

  HNP daemon   : [[22239,0],0] on node ip-0A000406
  Remote daemon: [[22239,0],1] on node hpc-1

Looks like OpenMPI is working with the Slurm node name (hpc-1) rather than that actual hostname (ip-0A000406), and then possibly getting upset as a result?

If I've missed something really obvious, please do point it out :)

Thanks,

John

Support for Multiple VM Sizes per Partition

The current cyclecloud_slurm does not support either multiple MachineType values per nodearray, nor multiple nodearrays assigned to the same Slurm partition. If multiple values for either are supplied, the python code will take only the first value in the list. Remarks in the partition class definition say that a one-to-one mapping of partition names to nodearrays is required.

Cyclecloud cluster templates themselves support multiple machine type values per nodearray and Slurm supports multiple machine types per partition. The current limitation of one machine type per partition is a function of the Cyclecloud implementation. Users of a cluster would benefit from being able to ask for a number of cores in a single partition and having the scheduler determine which size VM to create.

Location of files in the Blobs

Hello,

There are a bunch of files listed in the files list presented in the project.ini.

I understand that Github is not an object storage and hence the files are not in the repository. Could someone point me to the files ?

No autoscaling when using Centos8 base image

Hi, it seems that autoscaling no longer works with Centos8

Tested with :

Cyclecloud Version: 8.1.0-1275
Cyclecloud-Slurm 2.4.2

Results:
Centos8 + Slurm - 20.11.0-1 = No Autoscaling
Centos7 + Slurm - 20.11.0-1 = Autoscale works

[root@ip-0A781804 slurmctld]# cat slurmctld.log
[2020-12-18T00:15:27.016] debug:  Log file re-opened
[2020-12-18T00:15:27.020] debug:  creating clustername file: /var/spool/slurmd/clustername
[2020-12-18T00:15:27.021] error: Configured MailProg is invalid
[2020-12-18T00:15:27.021] slurmctld version 20.11.0 started on cluster asdasd
[2020-12-18T00:15:27.021] cred/munge: init: Munge credential signature plugin loaded
[2020-12-18T00:15:27.021] debug:  auth/munge: init: Munge authentication plugin loaded
[2020-12-18T00:15:27.021] select/cons_res: common_init: select/cons_res loaded
[2020-12-18T00:15:27.021] select/cons_tres: common_init: select/cons_tres loaded
[2020-12-18T00:15:27.021] select/cray_aries: init: Cray/Aries node selection plugin loaded
[2020-12-18T00:15:27.021] select/linear: init: Linear node selection plugin loaded with argument 20
[2020-12-18T00:15:27.021] preempt/none: init: preempt/none loaded
[2020-12-18T00:15:27.021] debug:  acct_gather_energy/none: init: AcctGatherEnergy NONE plugin loaded
[2020-12-18T00:15:27.021] debug:  acct_gather_Profile/none: init: AcctGatherProfile NONE plugin loaded
[2020-12-18T00:15:27.021] debug:  acct_gather_interconnect/none: init: AcctGatherInterconnect NONE plugin loaded
[2020-12-18T00:15:27.021] debug:  acct_gather_filesystem/none: init: AcctGatherFilesystem NONE plugin loaded
[2020-12-18T00:15:27.022] debug:  jobacct_gather/none: init: Job accounting gather NOT_INVOKED plugin loaded
[2020-12-18T00:15:27.022] ext_sensors/none: init: ExtSensors NONE plugin loaded
[2020-12-18T00:15:27.022] debug:  switch/none: init: switch NONE plugin loaded
[2020-12-18T00:15:27.022] debug:  shutting down backup controllers (my index: 0)
[2020-12-18T00:15:27.022] accounting_storage/none: init: Accounting storage NOT INVOKED plugin loaded
[2020-12-18T00:15:27.022] debug:  create_mmap_buf: Failed to open file `/var/spool/slurmd/assoc_usage`, No such file or directory
[2020-12-18T00:15:27.022] debug:  Reading slurm.conf file: /etc/slurm/slurm.conf
[2020-12-18T00:15:27.023] debug:  NodeNames=hpc-pg0-[1-4] setting Sockets=60 based on CPUs(60)/(CoresPerSocket(1)/ThreadsPerCore(1))
[2020-12-18T00:15:27.023] debug:  NodeNames=htc-[1-5] setting Sockets=60 based on CPUs(60)/(CoresPerSocket(1)/ThreadsPerCore(1))
[2020-12-18T00:15:27.023] debug:  Reading cgroup.conf file /etc/slurm/cgroup.conf
[2020-12-18T00:15:27.023] topology/tree: init: topology tree plugin loaded
[2020-12-18T00:15:27.023] debug:  No DownNodes
[2020-12-18T00:15:27.023] debug:  create_mmap_buf: Failed to open file `/var/spool/slurmd/last_config_lite`, No such file or directory
[2020-12-18T00:15:27.140] debug:  Log file re-opened
[2020-12-18T00:15:27.141] sched: Backfill scheduler plugin loaded
[2020-12-18T00:15:27.141] debug:  topology/tree: _read_topo_file: Reading the topology.conf file
[2020-12-18T00:15:27.141] topology/tree: _validate_switches: TOPOLOGY: warning -- no switch can reach all nodes through its descendants. If this is not intentional, fix the topology.conf file.
[2020-12-18T00:15:27.141] debug:  topology/tree: _log_switches: Switch level:0 name:hpc-Standard_HB60rs-pg0 nodes:hpc-pg0-[1-4] switches:(null)
[2020-12-18T00:15:27.141] debug:  topology/tree: _log_switches: Switch level:0 name:htc nodes:htc-[1-5] switches:(null)
[2020-12-18T00:15:27.141] route/default: init: route default plugin loaded
[2020-12-18T00:15:27.141] debug:  create_mmap_buf: Failed to open file `/var/spool/slurmd/node_state`, No such file or directory
[2020-12-18T00:15:27.141] error: Could not open node state file /var/spool/slurmd/node_state: No such file or directory
[2020-12-18T00:15:27.141] error: NOTE: Trying backup state save file. Information may be lost!
[2020-12-18T00:15:27.141] debug:  create_mmap_buf: Failed to open file `/var/spool/slurmd/node_state.old`, No such file or directory
[2020-12-18T00:15:27.141] No node state file (/var/spool/slurmd/node_state.old) to recover
[2020-12-18T00:15:27.141] debug:  create_mmap_buf: Failed to open file `/var/spool/slurmd/job_state`, No such file or directory
[2020-12-18T00:15:27.141] error: Could not open job state file /var/spool/slurmd/job_state: No such file or directory
[2020-12-18T00:15:27.141] error: NOTE: Trying backup state save file. Jobs may be lost!
[2020-12-18T00:15:27.141] debug:  create_mmap_buf: Failed to open file `/var/spool/slurmd/job_state.old`, No such file or directory
[2020-12-18T00:15:27.142] No job state file (/var/spool/slurmd/job_state.old) to recover
[2020-12-18T00:15:27.142] select/cons_tres: part_data_create_array: select/cons_tres: preparing for 2 partitions
[2020-12-18T00:15:27.142] debug:  gres/gpu: init: loaded
[2020-12-18T00:15:27.142] debug:  gpu/generic: init: init: GPU Generic plugin loaded
[2020-12-18T00:15:27.142] debug:  gres/gpu: init: loaded
[2020-12-18T00:15:27.142] debug:  gres/gpu: init: loaded
[2020-12-18T00:15:27.142] debug:  gres/gpu: init: loaded
[2020-12-18T00:15:27.142] debug:  gres/gpu: init: loaded
[2020-12-18T00:15:27.142] debug:  gres/gpu: init: loaded
[2020-12-18T00:15:27.142] debug:  gres/gpu: init: loaded
[2020-12-18T00:15:27.142] debug:  gres/gpu: init: loaded
[2020-12-18T00:15:27.142] debug:  gres/gpu: init: loaded
[2020-12-18T00:15:27.142] debug:  Updating partition uid access list
[2020-12-18T00:15:27.142] debug:  create_mmap_buf: Failed to open file `/var/spool/slurmd/resv_state`, No such file or directory
[2020-12-18T00:15:27.143] error: Could not open reservation state file /var/spool/slurmd/resv_state: No such file or directory
[2020-12-18T00:15:27.143] error: NOTE: Trying backup state save file. Reservations may be lost
[2020-12-18T00:15:27.143] debug:  create_mmap_buf: Failed to open file `/var/spool/slurmd/resv_state.old`, No such file or directory
[2020-12-18T00:15:27.143] No reservation state file (/var/spool/slurmd/resv_state.old) to recover
[2020-12-18T00:15:27.143] debug:  create_mmap_buf: Failed to open file `/var/spool/slurmd/trigger_state`, No such file or directory
[2020-12-18T00:15:27.143] error: Could not open trigger state file /var/spool/slurmd/trigger_state: No such file or directory
[2020-12-18T00:15:27.143] error: NOTE: Trying backup state save file. Triggers may be lost!
[2020-12-18T00:15:27.143] debug:  create_mmap_buf: Failed to open file `/var/spool/slurmd/trigger_state.old`, No such file or directory
[2020-12-18T00:15:27.143] No trigger state file (/var/spool/slurmd/trigger_state.old) to recover
[2020-12-18T00:15:27.143] read_slurm_conf: backup_controller not specified
[2020-12-18T00:15:27.143] Reinitializing job accounting state
[2020-12-18T00:15:27.143] select/cons_tres: select_p_reconfigure: select/cons_tres: reconfigure
[2020-12-18T00:15:27.143] select/cons_tres: part_data_create_array: select/cons_tres: preparing for 2 partitions
[2020-12-18T00:15:27.143] Running as primary controller
[2020-12-18T00:15:27.143] debug:  No backup controllers, not launching heartbeat.
[2020-12-18T00:15:27.143] debug:  priority/basic: init: Priority BASIC plugin loaded
[2020-12-18T00:15:27.143] No parameter for mcs plugin, default values set
[2020-12-18T00:15:27.143] mcs: MCSParameters = (null). ondemand set.
[2020-12-18T00:15:27.143] debug:  mcs/none: init: mcs none plugin loaded
[2020-12-18T00:15:57.143] debug:  sched/backfill: _attempt_backfill: beginning
[2020-12-18T00:15:57.143] debug:  sched/backfill: _attempt_backfill: no jobs to backfill
[2020-12-18T00:16:27.212] SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2
[2020-12-18T00:16:27.212] debug:  sched: Running job scheduler
[2020-12-18T00:17:27.284] debug:  sched: Running job scheduler
[2020-12-18T00:17:27.285] debug:  shutting down backup controllers (my index: 0)
[2020-12-18T00:18:27.362] debug:  sched: Running job scheduler
[2020-12-18T00:19:27.438] debug:  sched: Running job scheduler
[2020-12-18T00:19:27.438] debug:  shutting down backup controllers (my index: 0)
[2020-12-18T00:20:27.512] debug:  sched: Running job scheduler
[2020-12-18T00:20:27.513] debug:  create_mmap_buf: Failed to open file `/var/spool/slurmd/job_state`, No such file or directory
[2020-12-18T00:20:27.513] error: Could not open job state file /var/spool/slurmd/job_state: No such file or directory
[2020-12-18T00:20:27.513] error: NOTE: Trying backup state save file. Jobs may be lost!
[2020-12-18T00:20:27.513] debug:  create_mmap_buf: Failed to open file `/var/spool/slurmd/job_state.old`, No such file or directory
[2020-12-18T00:20:27.513] No job state file (/var/spool/slurmd/job_state.old) found
[2020-12-18T00:21:27.676] debug:  sched: Running job scheduler
[2020-12-18T00:21:27.676] debug:  shutting down backup controllers (my index: 0)
[2020-12-18T00:22:27.749] debug:  sched: Running job scheduler
[2020-12-18T00:23:27.821] debug:  sched: Running job scheduler
[2020-12-18T00:23:27.821] debug:  shutting down backup controllers (my index: 0)
[2020-12-18T00:24:27.892] debug:  sched: Running job scheduler
[2020-12-18T00:25:27.966] debug:  Updating partition uid access list
[2020-12-18T00:25:27.966] debug:  sched: Running job scheduler
[2020-12-18T00:25:28.067] debug:  shutting down backup controllers (my index: 0)
[2020-12-18T00:26:27.139] debug:  sched: Running job scheduler
[2020-12-18T00:27:27.212] debug:  sched: Running job scheduler
[2020-12-18T00:27:28.214] debug:  shutting down backup controllers (my index: 0)
[2020-12-18T00:28:27.286] debug:  sched: Running job scheduler
[2020-12-18T00:29:27.361] debug:  sched: Running job scheduler
[2020-12-18T00:29:28.362] debug:  shutting down backup controllers (my index: 0)
[2020-12-18T00:30:27.437] debug:  sched: Running job scheduler
[2020-12-18T00:30:55.711] req_switch=-2 network='(null)'
[2020-12-18T00:30:55.711] Setting reqswitch to 1.
[2020-12-18T00:30:55.711] returning.
[2020-12-18T00:30:55.712] sched: _slurm_rpc_allocate_resources JobId=2 NodeList=htc-1 usec=1268
[2020-12-18T00:30:56.261] debug:  sched/backfill: _attempt_backfill: beginning
[2020-12-18T00:30:56.261] debug:  sched/backfill: _attempt_backfill: no jobs to backfill
[2020-12-18T00:30:57.263] error: power_save: program exit status of 1
[2020-12-18T00:31:27.588] debug:  sched: Running job scheduler
[2020-12-18T00:31:28.589] debug:  shutting down backup controllers (my index: 0)
[root@ip-0A781804 slurmctld]# squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                 2       htc hostname  andreim CF       1:37      1 htc-1

[root@ip-0A781804 slurmctld]# sinfo -V
slurm 20.11.0

[root@ip-0A781804 slurmctld]# systemctl status slurmctld.service
● slurmctld.service - Slurm controller daemon
   Loaded: loaded (/usr/lib/systemd/system/slurmctld.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/slurmctld.service.d
           └─override.conf
   Active: active (running) since Fri 2020-12-18 00:15:26 UTC; 17min ago
 Main PID: 3980 (slurmctld)
    Tasks: 8
   Memory: 5.5M
   CGroup: /system.slice/slurmctld.service
           └─3980 /usr/sbin/slurmctld -D

Dec 18 00:15:26 ip-0A781804 systemd[1]: Started Slurm controller daemon.
[root@ip-0A781804 slurm]# cat topology.conf
SwitchName=hpc-Standard_HB60rs-pg0 Nodes=hpc-pg0-[1-4]
SwitchName=htc Nodes=htc-[1-5]

[root@ip-0A781804 slurm]# cat slurm.conf
MpiDefault=none
ProctrackType=proctrack/cgroup
ReturnToService=2
PropagateResourceLimits=ALL
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmdPidFile=/var/run/slurmd.pid
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser="slurm"
StateSaveLocation=/var/spool/slurmd
SwitchType=switch/none
TaskPlugin=task/affinity,task/cgroup
SchedulerType=sched/backfill
SelectType=select/cons_tres
GresTypes=gpu
SelectTypeParameters=CR_Core_Memory
ClusterName="ASDASD"
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=debug
SlurmctldLogFile=/var/log/slurmctld/slurmctld.log
SlurmctldParameters=idle_on_node_suspend
SlurmdDebug=debug
SlurmdLogFile=/var/log/slurmd/slurmd.log
TopologyPlugin=topology/tree
JobSubmitPlugins=job_submit/cyclecloud
PrivateData=cloud
TreeWidth=65533
ResumeTimeout=1800
SuspendTimeout=600
SuspendTime=300
ResumeProgram=/opt/cycle/jetpack/system/bootstrap/slurm/resume_program.sh
ResumeFailProgram=/opt/cycle/jetpack/system/bootstrap/slurm/resume_fail_program.sh
SuspendProgram=/opt/cycle/jetpack/system/bootstrap/slurm/suspend_program.sh
SchedulerParameters=max_switch_wait=24:00:00
AccountingStorageType=accounting_storage/none
Include cyclecloud.conf
SlurmctldHost=ip-0A781804

[root@ip-0A781804 slurm]# cat cyclecloud.conf
# Note: CycleCloud reported a RealMemory of 228884 but we reduced it by 11444 (i.e. max(1gb, 5%)) to account for OS/VM overhead which
# would result in the nodes being rejected by Slurm if they report a number less than defined here.
# To pick a different percentage to dampen, set slurm.dampen_memory=X in the nodearray's Configuration where X is percentage (5 = 5%).
PartitionName=hpc Nodes=hpc-pg0-[1-4] Default=YES DefMemPerCPU=3624 MaxTime=INFINITE State=UP
Nodename=hpc-pg0-[1-4] Feature=cloud STATE=CLOUD CPUs=60 CoresPerSocket=1 RealMemory=217440
# Note: CycleCloud reported a RealMemory of 228884 but we reduced it by 11444 (i.e. max(1gb, 5%)) to account for OS/VM overhead which
# would result in the nodes being rejected by Slurm if they report a number less than defined here.
# To pick a different percentage to dampen, set slurm.dampen_memory=X in the nodearray's Configuration where X is percentage (5 = 5%).
PartitionName=htc Nodes=htc-[1-5] Default=NO DefMemPerCPU=3624 MaxTime=INFINITE State=UP
Nodename=htc-[1-5] Feature=cloud STATE=CLOUD CPUs=60 CoresPerSocket=1 RealMemory=217440

Question regarding /sched default size

I noticed that by default /sched mount in the slurm.txt template has 1T of memory. But as far as I understood it only contained some slurm and cyclecloud configurations. Why does it need all this memory then?

`slurmd_sysconfig` may not work for Debian users when `slurm.install` is false

Native Debian (and also Ubuntu) use /etc/default/slurmd/ as system config path, instead of RHEL's /etc/sysconfig/slurmd.
Therefore, if user use native Slurm, the following code may useless.

directory '/etc/sysconfig' do
action :create
end
file '/etc/sysconfig/slurmd' do
content slurmd_sysconfig
mode '0700'
owner 'slurm'
group 'slurm'
end

An fix can be, when node['slurm']['install'] == true and /etc/default/slurmd/ directory exists, write file into this directory instead.

Slow autoscale with Array jobs

Autoscaling after submitting an Slurm Array type job results in very slow spinning up of the cluster.
This appears to be because even with an arbitrarily large amount of array jobs specified:

  • only a single extra node is requested,
  • this then spins up,
  • once spun-up, the next array job starts, and only then is a new node is requested.

Due to the ~5-10 minutes to spin up an instance, this severely limits autoscaling.

A work around is to submit a dummy non-array job requesting a suitable number of nodes.
Even if this job is stuck behind the Array job in the queue, it still results in the cluster spinning up many nodes at once, which the array jobs are then dispatched to.

e.g. to request 100 CPUs worth of nodes
echo '#!/bin/sh' | sbatch -n 100

"slurmctld restart" stuck after scaling the nodes

CycleCloud Version - 8.1.0-1275
Slurm - 19.05.8-1

Scenario:

  1. Changing the Max core count for the HPC array in the CycleCloud UI
  2. Run the scale command (./cyclecloud_slurm.sh scale) and we see below behavior:

{{{
sinfo doesn't show up new added node and it seems slurmctld stuck in restart:
[root@ip-0A060009 slurm]# sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
hpc* up infinite 2 alloc hpc-pg0-[1-2]
htc up infinite 2 idle~ htc-[1-2]

[root@ip-0A060009 slurm]# systemctl status slurmctld
● slurmctld.service - Slurm controller daemon
Loaded: loaded (/usr/lib/systemd/system/slurmctld.service; enabled; vendor preset: disabled)
Drop-In: /etc/systemd/system/slurmctld.service.d
└─override.conf
Active: failed (Result: exit-code) since Thu 2021-02-18 20:42:28 UTC; 3s ago
Process: 11278 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS (code=exited, status=0/SUCCESS)
Main PID: 11280 (code=exited, status=1/FAILURE)

Feb 18 20:42:28 ip-0A060009 systemd[1]: Starting Slurm controller daemon...
Feb 18 20:42:28 ip-0A060009 systemd[1]: Started Slurm controller daemon.
Feb 18 20:42:28 ip-0A060009 systemd[1]: slurmctld.service: main process exited, code=exited, status=1/FAILURE
Feb 18 20:42:28 ip-0A060009 systemd[1]: Unit slurmctld.service entered failed state.
Feb 18 20:42:28 ip-0A060009 systemd[1]: slurmctld.service failed.
}}}

Connection refused for `Isend` over OpenMPI for `F2s_v2` nodes

Hello,

OpenMPI was recently upgraded from version 4.0.5 to 4.1.0 on CycleCloud. Since the upgrade I'm having issues using non-blocking communication with Slurm on CycleCloud.

First, I have to use -mca ^hcoll to avoid warnings regarding InfiniBand, which F2s_v2 nodes are not equipped with. I had this issue also with version 4.0.5.

Second, since the recent upgrade to OpenMPI v4.1.0, non-blocking communication has stopped working for me. The code I have worked for OpenMPI v4.0.5.

This is the error I'm getting. I've confirmed that the problem occurs when I call MPI_Isend. I'm attaching a small example to reproduce this problem below.

Process 1 started 
Initiating communication on worker 1
[1622528555.546527] [ip-0A000007:9268 :0] sock.c:259 UCX ERROR connect(fd=30, dest_addr=127.0.0.1:58173) failed: Connection refused
[ip-0A000007:09268] pml_ucx.c:383  Error: ucp_ep_create(proc=0) failed: Destination is unreachable
[ip-0A000007:09268] pml_ucx.c:453  Error: Failed to resolve UCX endpoint for rank 0
[ip-0A000007:09268] *** An error occurred in MPI_Isend
[ip-0A000007:09268] *** reported by process [3673554945,1]
[ip-0A000007:09268] *** on communicator MPI_COMM_WORLD
[ip-0A000007:09268] *** MPI_ERR_OTHER: known error not in list
[ip-0A000007:09268] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[ip-0A000007:09268] ***    and potentially your MPI job)

Program code (mpi_isend.c):

#include <mpi.h>
#include <math.h>
#include <stdio.h>

int main(int argc, char* argv[]) {
      MPI_Comm comm;
      MPI_Request request;
      MPI_Status status;
      int myid, master, tag, proc, my_int, p, ierr;

      comm = MPI_COMM_WORLD;       
      ierr = MPI_Init(&argc, &argv);           /* starts MPI */
      MPI_Comm_rank(comm, &myid);           /* get current process id */
      MPI_Comm_size(comm, &p);               /* get number of processes */

      master = 0;
      tag = 123;        /* set the tag to identify this particular job */
      printf("Process %d started", myid);

      if(myid == master) {
            for (proc=1;proc<p;proc++) {
                  MPI_Recv(
                        &my_int, 1, MPI_FLOAT,    /* triplet of buffer, size, data type */
                        MPI_ANY_SOURCE,       /* message source */
                        MPI_ANY_TAG,          /* message tag */
                        comm, &status);        /* status identifies source, tag */
                  printf("Received from 1 worker");
            }
            printf("Master finished");
      } else {
	    printf("Initiating communication on worker %d", myid);
            MPI_Isend(       /* non-blocking send */
      	      &my_int, 1, MPI_FLOAT,       /* triplet of buffer, size, data type */
                  master, 
                  tag,
                  comm, 
                  &request);       /* send my_int to master */
            MPI_Wait(&request, &status);    /* block until Isend is done */
	    printf("Worker %d finished", myid);
      }
      MPI_Finalize();                        /* let MPI finish up ... */
}

Jobfile (mpi_isend.job):

#!/bin/sh -l
#SBATCH --job-name=pool
#SBATCH --output=pool.out
#SBATCH --nodes=2
#SBATCH --time=600:00
#SBATCH --tasks-per-node=1
#SBATCH --partition=hpc
mpirun -mca ^hcoll mpi_isend

Steps to reproduce:

mpicc mpi_isend.c -o mpi_isend
sbatch mpi_isend.job

SlurmDBD role for scheduler HA

Per discussion with @anhoward there is a need for a new role for scenario with HA schedulers and Slurm Accounting. When Primary scheduler "fails" the SlurmDBD should be able to run on the HA scheduler.

Slurm 3.0.1 dynamic partition gres.conf uses entire core allotment

CC = 8.4.0
Slurm Cluster-init = 3.0.1
Slurm version = 22.05.8-1

ISSUE
using a dynamic partition with multiple VM types, including GPU, will result in gres.conf built using the entire allotment of cores for the nodearray even if the defined GPU is a subset of the total.

STEPS TO REPRODUCE

  1. create a custom cluster with dynamic nodearray limited to 100 cores and Config.Multiselect = true
  2. select multiple VM types for the dynamic partition and start the cluster
  3. run scontrol create nodename.. for autoscaling the nodes in Slurm...for example:
scontrol create NodeName=jm-slurm-multi-low-[1-10] CPUs=2 Boards=1 SocketsPerBoard=1 CoresPerSocket=2 ThreadsPerCore=1 RealMemory=3072 Feature=dyn,Standard_F2s_v2 State=CLOUD
scontrol create NodeName=jm-slurm-multi-mid-[1-10] CPUs=2 Boards=1 SocketsPerBoard=1 CoresPerSocket=2 ThreadsPerCore=1 RealMemory=7168 Feature=dyn,Standard_D2ds_v4 State=CLOUD
scontrol create NodeName=jm-slurm-multi-high-[1-15] CPUs=2 Boards=1 SocketsPerBoard=1 CoresPerSocket=2 ThreadsPerCore=1 RealMemory=15360 Feature=dyn,Standard_E2ds_v4 State=CLOUD
scontrol create NodeName=jm-slurm-multi-gpu-[1-5] CPUs=6 Boards=1 SocketsPerBoard=1 CoresPerSocket=6 ThreadsPerCore=1 RealMemory=54476 Gres=gpu:1 Feature=dyn,Standard_NC6_Promo State=CLOUD
  1. run azslurm scale to build the gres.conf
  2. view /etc/slurm/gres.conf
root@jm-slurm-multi-hn:~# cat /etc/slurm/gres.conf

Nodename=jm-slurm-mutli-dynamic-[1-16] Name=gpu Count=1 File=/dev/nvidia0

WORKAROUND
Manually update the gres.conf for both hostname and quantity

Slurm Accounting add cluster fails on cluster terminate/restart

Slurm proj: 2.4.4
Slurm ver: 20.11.4-1
CC ver: 8.1.0-1275

A Slurm cluster connected to MariaDB service in Azure will fail to restart cleanly when terminiated/restarted. The following error is seen:

================================================================================
Error executing action `run` on resource 'bash[Add cluster to slurmdbd]'
================================================================================

Mixlib::ShellOut::ShellCommandFailed
------------------------------------
Expected process to exit with [0], but received '1'
---- Begin output of "bash"  "/tmp/chef-script20210401-1390-h8kq8x" ----
STDOUT: This cluster ucla-slurm-docker-grafana already exists.  Not adding.
STDERR: 
---- End output of "bash"  "/tmp/chef-script20210401-1390-h8kq8x" ----
Ran "bash"  "/tmp/chef-script20210401-1390-h8kq8x" returned 1

Resource Declaration:
---------------------
# In /opt/cycle/jetpack/system/chef/cache/cookbooks/slurm/recipes/accounting.rb

 77: bash 'Add cluster to slurmdbd' do
 78:     code <<-EOH
 79:         sacctmgr -i add cluster #{clustername} && touch /etc/slurmdbd.configured 
 80:         EOH
 81:     not_if { ::File.exist?('/etc/slurmdbd.configured') }
 82:     not_if "sleep 5 && sacctmgr show cluster | grep #{clustername}" 
 83: end
 84: 

Compiled Resource:
------------------
# Declared in /opt/cycle/jetpack/system/chef/cache/cookbooks/slurm/recipes/accounting.rb:77:in `from_file'

bash("Add cluster to slurmdbd") do
  action [:run]
  default_guard_interpreter :default
  command nil
  backup 5
  returns 0
  user nil
  interpreter "bash"
  declared_type :bash
  cookbook_name "slurm"
  recipe_name "accounting"
  code "        sacctmgr -i add cluster UCLA-slurm-docker-grafana && touch /etc/slurmdbd.configured \n"
  domain nil
  not_if { #code block }
  not_if "sleep 5 && sacctmgr show cluster | grep UCLA-slurm-docker-grafana"
end

System Info:
------------
chef_version=13.12.14
platform=centos
platform_version=7.7.1908
ruby=ruby 2.5.7p206 (2019-10-01 revision 67816) [x86_64-linux]
program_name=chef-solo worker: ppid=1385;start=17:48:00;
executable=/opt/cycle/jetpack/system/embedded/bin/chef-solo


Running handlers:
  - CycleCloud::ExceptionHandler
  - CycleCloud::ExceptionHandler
Running handlers complete
Chef Client failed. 9 resources updated in 16 seconds
Error: A problem occurred while running Chef, check chef-client.log for details

On the scheduler node I see the cluster already exists in sacctmgr:

   [cyclecloud@glinc5zuux6 ~]$ sacctmgr list cluster
   Cluster     ControlHost  ControlPort   RPC     Share GrpJobs       GrpTRES GrpSubmit MaxJobs       MaxTRES MaxSubmit        MaxWall                  QOS   Def QOS 
---------- --------------- ------------ ----- --------- ------- ------------- --------- ------- ------------- --------- ----------- -------------------- --------- 
ucla-slur+                            0  9216         1                                                                                           normal           

I suspect scheduler.rb line 77 needs a check if cluster exists before trying to add the cluster.

support RHEL images

the following files need to be updated to recognize RHEL images:

slurm/recipes/execute.rb
Line 68: change to when 'centos', 'rhel'

slurm/recipes/default.rb
Line 89: change to when 'centos', 'rhel'

slurm/attributes/default.rb
Line 11: change to when 'centos', 'rhel'

Incompatibility with tagging policies

When attempting to build clusters with one resource group per cluster, the build will fail if there are any policies enforcing the creation of certain tags on a resource group. Please could you include an option to include arbitrary tags on the resource groups/resources created when deploying a cluster?

GPU allocation stuck on "Waiting for resource configuration"

I've uploaded the version 2.7.1 to the cluster (after downloading the blobs from github). I've modified the template to include a gpu cluster along with parameter definitions corresponding to that cluster. I've selected NC6 as the GPU machine name and started the cluster. Everything starts fine and I'm able to allocate the F32_vs nodes that correspond to the hpc partition. However when allocating the GPU node, the node starts without any error reported in the console, however, in the scheduler, slurm does not appear to recognize this fact and is still stuck on

[hpcadmin@ip-0A030006 ~]$ salloc -p gpu -n 1
salloc: Granted job allocation 3
salloc: Waiting for resource configuration

This is a problem with both the CentOS 7 and Almalinux 8 operating systems. I have the following cloud-init scripts to install singularity in each case. These don't appear to be a problem as any error in the script typically gets reported as an error in the web console.

CentOS 7:

#!/usr/bin/bash
wget https://github.com/sylabs/singularity/releases/download/v3.9.9/singularity-ce-3.9.9-1.el7.x86_64.rpm
sudo yum localinstall -y ./singularity-ce-3.9.9-1.el7.x86_64.rpm
rm ./singularity-ce-3.9.9-1.el7.x86_64.rpm

Almalinux 8

#!/usr/bin/bash
wget https://github.com/sylabs/singularity/releases/download/v3.9.9/singularity-ce-3.9.9-1.el8.x86_64.rpm
sudo yum localinstall -y ./singularity-ce-3.9.9-1.el8.x86_64.rpm
rm ./singularity-ce-3.9.9-1.el8.x86_64.rpm

Slurm headless/burst topology.conf when "scale" cluster

When the cluster is scaled (ie. cyclecloud_slurm.sh scale) the topology.conf file is overwritten. That also overwrites the onprem switch definitions. Is it possible to prevent this or separate the onprem topology from the Cycle topology?

Add -b option to slurmd startup

There are occasions where slurmd times out checking in with slurmctld due to high uptime before slurmd starts. Adding -b to the startup of slurmd works around the issue. We should include it by default.

Confusing PCPU vs VCPU behavior

  1. We need to make this project's use of physical vs virtual cores and how those are represented configurable.

  2. CoresPerSocket does not seem to be calculated correctly. Here is a 32 vcpu, 16 pcpu VM but does not use 2 for the CoresPerSocket
    Nodename=htc-[1-4] Feature=cloud STATE=CLOUD CPUs=16 CoresPerSocket=1 RealMemory=124518

Node doesn't match existing scaleset when you add a new node to a node array

CycleCloud - 8.1.0-1275
Slurm - 19.05.8-1

Scenario:

  1. Submit a job to the HPC node array.
  2. Increase max core count from UI and run scale (cyclecloud_slurm.sh scale) command.
  3. Node which is in the acquiring state stuck with the below message:

{{{
This node does not match existing scaleset attributes: Configuration.cyclecloud.mounts.additional_nfs.export_path, Configuration.cyclecloud.mounts.additional_nfs.mountpoint
}}}

Scaleset mismatch
Node attributes - Good Node
Node attributes - Failed Node

Enable Hyperthreading Support using threads

The latest release switched from vcpu to pcpu, so hyperthreading is no longer captured in the configuration. Suggest using threads to allow 2 threads per core/cpu on VMs that support it.

Mounting docker volume in docker script causes a problem

docker run -v $(pwd)/specs/default/cluster-init/files:/source -v $(pwd)/blobs:/root/rpmbuild/RPMS/x86_64 -ti centos:7 /bin/bash /source/00-build-slurm.sh triggers the error:
/bin/bash: /source/00-build-slurm.sh: Permission denied

because inside docker the privilliges of source dir are the following

[root@11d3f968ca8c /]# ll
total 12
-rw-r--r--. 1 root root 12082 Mar 5 17:36 anaconda-post.log
drwxrwxr-x. 2 1002 1002 49 Apr 11 12:55 source

additional support for RHEL clones (almalinux, rockylinux)

Hi,

as new RHEL - clones take over CentOS lack of bug-compatibility I'd consider AlmaLinux and Rocky Linux as major successor platforms.
Currently this repo only supports centos and rhel as platform strings but should support others, too - especially as those are bug-compatible builds.
I understand though, that current CycleCloud support is limited to ubunto, centos, rhel.

Thanks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.