azure / cyclecloud-pbspro Goto Github PK

Example Azure CycleCloud PBSpro cluster type

License: MIT License

Python 86.29% Ruby 7.42% Shell 6.17% HTML 0.12%

cyclecloud-pbspro's Introduction

Azure CycleCloud OpenPBS project

OpenPBS is a highly configurable open source workload manager. See the OpenPBS project site for an overview and the PBSpro documentation for more information on using, configuring, and troubleshooting OpenPBS in general.

Versions

OpenPBS (formerly PBS Professional OSS) is released as part of version 20.0.0. PBSPro OSS is still available in CycleCloud by specifying the PBSPro OSS version.

   [[[configuration]]]
   pbspro.version = 18.1.4-0

Installing Manually

Note: When using the cluster that is shipped with CycleCloud, the autoscaler and default queues are already installed.

First, download the installer pkg from GitHub. For example, you can download the 2.0.23 release here

# Prerequisite: python3, 3.6 or newer, must be installed and in the PATH
wget https://github.com/Azure/cyclecloud-pbspro/releases/download/2.0.23/cyclecloud-pbspro-pkg-2.0.23.tar.gz
tar xzf cyclecloud-pbspro-pkg-2.0.23.tar.gz
cd cyclecloud-pbspro
# Optional, but recommended. Adds relevant resources and enables strict placement
./initialize_pbs.sh
# Optional. Sets up workq as a colocated, MPI focused queue and creates htcq for non-MPI workloads.
./initialize_default_queues.sh

# Creates the azpbs autoscaler
./install.sh  --venv /opt/cycle/pbspro/venv

# If you have jetpack available, you may use the following:
# ./generate_autoscale_json.sh --install-dir /opt/cycle/pbspro \
#                              --username $(jetpack config cyclecloud.config.username) \
#                              --password $(jetpack config cyclecloud.config.password) \
#                              --url $(jetpack config cyclecloud.config.web_server) \
#                              --cluster-name $(jetpack config cyclecloud.cluster.name)

# Otherwise insert your username, password, url, and cluster name here.
./generate_autoscale_json.sh --install-dir /opt/cycle/pbspro \
                             --username user \
                             --password password \
                             --url https://fqdn:port \
                             --cluster-name cluster_name

# lastly, run this to understand any changes that may be required.
# For example, you typically have to add the ungrouped and group_id resources
# to the /var/spool/pbs/sched_priv/sched_priv file and restart.
## [root@scheduler cyclecloud-pbspro]# azpbs validate
## ungrouped is not defined for line 'resources:' in /var/spool/pbs/sched_priv/sched_config. Please add this and restart PBS
## group_id is not defined for line 'resources:' in /var/spool/pbs/sched_priv/sched_config. Please add this and restart PBS
azpbs validate

Autoscale and scalesets

In order to try and ensure that the correct VMs are provisioned for different types of jobs, CycleCloud treats autoscale of MPI and serial jobs differently in OpenPBS clusters.

For serial jobs, multiple VM scalesets (VMSS) are used in order to scale as quickly as possible. For MPI jobs to use the InfiniBand fabric for those instances that support it, all of the nodes allocated to the job have to be deployed in the same VMSS. CycleCloud handles this by using a PlacementGroupId that groups nodes with the same id into the same VMSS. By default, the workq appends the equivalent of -l place=scatter:group=group_id by using native queue defaults.

Hooks

Our PBS integration uses 3 different PBS hooks. autoscale does the bulk of the work required to scale the cluster up and down. All relevant log messages can be seen in /opt/cycle/pbspro/autoscale.log. cycle_sub_hook will validate jobs unless they use -l nodes syntax, in which case those jobs are held and later processed by our last hook cycle_sub_hook_periodic.

Autoscale Hook

The most important is the autoscale plugin, which runs by default on a 15 second interval. You can adjust this frequency by running

qmgr -c "set hook autoscale freq=NUM_SECONDS"

Submission Hooks

cycle_sub_hook will validate that your job has the proper placement restrictions set. If it encounters a problem, it will output a detailed message on why the job was rejected and how to resolve the issue. For example

$> echo sleep 300 | qsub -l select=2 -l place=scatter

Please do one of the following
    1) Ensure this placement is set by adding group=group_id to your -l place= statement
        Note: Queue workq's resource_defaults.place=group=group_id
    2) Add -l skipcyclesubhook=true on this job
        Note: If the resource does not exist, create it -> qmgr -c 'create resource skipcyclesubhook type=boolean'
    3) Disable this hook for this queue via queue defaults -> qmgr -c 'set queue workq resources_default.skipcyclesubhook=true'
    4) Disable this plugin - 'qmgr -c 'set hook cycle_sub_hook enabled=false'
        Note: Disabling this plugin may prevent -l nodes= style submissions from working properly.

One important note: if you are using Torque style submissions, i.e. those that uses -l nodes instead of -l select, PBS will simply convert that submission into an equivalent -l select style submission. However, the default placement defined for the queue is not respected by PBS when converting the job. To get around this, we will hold the job and our last hook, cycle_sub_hook_periodic will periodically update the job's placement and release it.

Configuring Resources

The cyclecloud-pbspro application matches PBS resources to azure cloud resources to provide rich autoscaling and cluster configuration tools. The application will be deployed automatically for clusters created via the CycleCloud UI or it can be installed on any PBS admin host on an existing cluster. For more information on defining resources in autoscale.json, see ScaleLib's documentation.

The default resources defined with the cluster template we ship with are

{"default_resources": [
   {
      "select": {},
      "name": "ncpus",
      "value": "node.vcpu_count"
   },
   {
      "select": {},
      "name": "group_id",
      "value": "node.placement_group"
   },
   {
      "select": {},
      "name": "host",
      "value": "node.hostname"
   },
   {
      "select": {},
      "name": "mem",
      "value": "node.memory"
   },
   {
      "select": {},
      "name": "vm_size",
      "value": "node.vm_size"
   },
   {
      "select": {},
      "name": "disk",
      "value": "size::20g"
   }]
}

Note that disk is currently hardcoded to size::20g because of platform limitations to determine how much disk a node will have. Here is an example of handling VM Size specific disk size

   {
      "select": {"node.vm_size": "Standard_F2"},
      "name": "disk",
      "value": "size::20g"
   },
   {
      "select": {"node.vm_size": "Standard_H44rs"},
      "name": "disk",
      "value": "size::2t"
   }

azpbs cli

The azpbs cli is the main interface for all autoscaling behavior. Note that it has a fairly powerful autocomplete capabilities. For example, typing azpbs create_nodes --vm-size and then you can tab-complete the list of possible VM Sizes. Autocomplete information is updated every azpbs autoscale cycle, but can also be refreshed manually by running azpbs refresh_autocomplete.

Command	Description
autoscale	End-to-end autoscale process, including creation, deletion and joining of nodes.
buckets	Prints out autoscale bucket information, like limits etc
config	Writes the effective autoscale config, after any preprocessing, to stdout
create_nodes	Create a set of nodes given various constraints. A CLI version of the nodemanager interface.
default_output_columns	Output what are the default output columns for an optional command.
delete_nodes	Deletes node, including draining post delete handling
demand	Dry-run version of autoscale.
initconfig	Creates an initial autoscale config. Writes to stdout
jobs	Writes out autoscale jobs as json. Note: Running jobs are excluded.
join_nodes	Adds selected nodes to the scheduler
limits	Writes a detailed set of limits for each bucket. Defaults to json due to number of fields.
nodes	Query nodes
refresh_autocomplete	Refreshes local autocomplete information for cluster specific resources and nodes.
remove_nodes	Removes the node from the scheduler without terminating the actual instance.
retry_failed_nodes	Retries all nodes in a failed state.
shell	Interactive python shell with relevant objects in local scope. Use --script to run python scripts
validate	Runs basic validation of the environment
validate_constraint	Validates then outputs as json one or more constraints.

azpbs buckets

Use the azpbs buckets command to see which buckets of compute are available, how many are available, and what resources they have.

azpbs buckets --output-columns nodearray,placement_group,vm_size,ncpus,mem,available_count

NODEARRAY PLACEMENT_GROUP     VM_SIZE         NCPUS MEM     AVAILABLE_COUNT
execute                       Standard_F2s_v2 1     4.00g   50             
execute                       Standard_D2_v4  1     8.00g   50             
execute                       Standard_E2s_v4 1     16.00g  50             
execute                       Standard_NC6    6     56.00g  16             
execute                       Standard_A11    16    112.00g 6              
execute   Standard_F2s_v2_pg0 Standard_F2s_v2 1     4.00g   50             
execute   Standard_F2s_v2_pg1 Standard_F2s_v2 1     4.00g   50             
execute   Standard_D2_v4_pg0  Standard_D2_v4  1     8.00g   50             
execute   Standard_D2_v4_pg1  Standard_D2_v4  1     8.00g   50             
execute   Standard_E2s_v4_pg0 Standard_E2s_v4 1     16.00g  50             
execute   Standard_E2s_v4_pg1 Standard_E2s_v4 1     16.00g  50             
execute   Standard_NC6_pg0    Standard_NC6    6     56.00g  16             
execute   Standard_NC6_pg1    Standard_NC6    6     56.00g  16             
execute   Standard_A11_pg0    Standard_A11    16    112.00g 6              
execute   Standard_A11_pg1    Standard_A11    16    112.00g 6

azpbs demand

It is common that you want to test out autoscaling without actually allocating anything. azpbs demand is a dry-run version of azpbs autoscale. Here is a simple example where we allocate two machines for a simple -l select=2 submission. As you can see, job id 1 is using one ncpus on two different nodes.

azpbs demand

NAME      JOB_IDS NCPUS
execute-1 1       0/1  
execute-2 1       0/1

azpbs create_nodes

Manually creating nodes via azpbs create_nodes is also quite powerful. Note that it also has a --dry-run mode as well.

Here is an example of allocating 100 slots of mem=memory::1g or 1gb partitions. Since our nodes have 4gb each, then we expect 25 nodes to be created.

azpbs create_nodes --keep-alive --vm-size Standard_F2s_v2 --slots 100 --constraint-expr mem=memory::1g --dry-run --output-columns name,/mem

NAME       MEM        
execute-1  0.00g/4.00g
...
execute-25 0.00g/4.00g

azpbs delete_/remove_nodes

azpbs supports safely removing a node from PBS. The different between delete_nodes and remove_nodes is simply that delete_nodes, on top of removing the node from PBS, will also delete the node. You may delete by hostname or node name. Pass in * to delete/remove all nodes.

azpbs shell

azpbs shell is a more advanced command that can be quit powerful. This command fully constructs the entire in-memory structures used by azpbs autoscale to allow the user to interact with them dynamically. All of the objects are passed in to the local scope, and can be listd by calling pbsprohelp(). This is a powerful debugging tool.

[root@pbsserver ~] azpbs shell
CycleCloud Autoscale Shell
>>> pbsprohelp()
config               - dict representing autoscale configuration.
cli                  - object representing the CLI commands
pbs_env              - object that contains data structures for queues, resources etc
queues               - dict of queue name -> PBSProQueue object
jobs                 - dict of job id -> Autoscale Job
scheduler_nodes      - dict of hostname -> node objects. These represent purely what the scheduler sees without additional booting nodes / information from CycleCloud
resource_definitions - dict of resource name -> PBSProResourceDefinition objects.
default_scheduler    - PBSProScheduler object representing the default scheduler.
pbs_driver           - PBSProDriver object that interacts directly with PBS and implements PBS specific behavior for scalelib.
demand_calc          - ScaleLib DemandCalculator - pseudo-scheduler that determines the what nodes are unnecessary
node_mgr             - ScaleLib NodeManager - interacts with CycleCloud for all node related activities - creation, deletion, limits, buckets etc.
pbsprohelp            - This help function
>>> queues.workq.resources_default
{'place': 'scatter:group=group_id'}
>>> jobs["0"].node_count
2

azpbs shell can also take in as an argument --script path/to/python_file.py, allowing the user to have full access to the in-memory structures, again by passing in the objects through the local scope, to customize the autoscale behavior.

[root@pbsserver ~] cat example.py 
for bucket in node_mgr.get_buckets():
    print(bucket.nodearray, bucket.vm_size, bucket.available_count)

[root@pbsserver ~] azpbs shell -s example.py 
execute Standard_F2s_v2 50
execute Standard_D2_v4 50
execute Standard_E2s_v4 50

Timeouts

By default we set idle and boot timeouts across all nodes.

   "boot_timeout": 3600

You can also set these per nodearray.

   "boot_timeout": {"default": 3600, "nodearray1": 7200, "nodearray2": 900},

Logging

By default, azpbs will use /opt/cycle/pbspro/logging.conf, as defined in /opt/cycle/pbsspro/autoscale.json. This will create the following logs.

/opt/cycle/pbspro/autoscale.log

autoscale.log is the main log for all azpbs invocations.

/opt/cycle/pbspro/qcmd.log

qcmd.log every PBS executable invocation and the response, so you can see exactly what commands are being run.

/opt/cycle/pbspro/demand.log

Every autoscale iteration, azpbs prints out a table of all of the nodes, their resources, their assigned jobs and more. This log contains these values and nothing else.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.microsoft.com.

When you submit a pull request, a CLA-bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

cyclecloud-pbspro's People

Contributors

Stargazers

Watchers

Forkers

jonshelley hmeiland leanderc2018 kjnam fhusseini05 marabgol isabella232 keith-thai mandargujrathi themorey jsaelices cadwrdeltamodeling wolfgang-desalvador equinor sourcecodecheck mikesecurity

cyclecloud-pbspro's Issues

Add HBv3 support

HBv3 resources are not recognized by scalelib.
Workaround is to add them at the beginning of the default_resources in the autoscale.json

"default_resources": [
{
"name": "ncpus",
"select": {
"node.vm_size": "Standard_HB120rs_v3"
},
"value": 120
},
{
"name": "mem",
"select": {
"node.vm_size": "Standard_HB120rs_v3"
},
"value": "memory::448g"
},
{
"name": "ncpus",
"select": {
"node.vm_size": "Standard_HB120-96rs_v3"
},
"value": 96
},
{
"name": "mem",
"select": {
"node.vm_size": "Standard_HB120-96rs_v3"
},
"value": "memory::448g"
},
{
"name": "ncpus",
"select": {
"node.vm_size": "Standard_HB120-64rs_v3"
},
"value": 64
},
{
"name": "mem",
"select": {
"node.vm_size": "Standard_HB120-64rs_v3"
},
"value": "memory::448g"
},
{
"name": "ncpus",
"select": {
"node.vm_size": "Standard_HB120-32rs_v3"
},
"value": 32
},
{
"name": "mem",
"select": {
"node.vm_size": "Standard_HB120-32rs_v3"
},
"value": "memory::448g"
}

Parameterize Azure.MaxScalesetSize

Other schedulers allow users to edit the max scaleset size via a parameter. Currently users must add Azure.MaxScalesetSize=X manually if they want something besides the default, 100.

Add ND96asr support

Please add the Standard_ND96asr_v4 in the supported resources

hwloc-libs RPM is no longer provided in the epel repo CentOS 8

openpbs installers failing with missing package hwloc-libs. This is impacting CentOS 8.

work around is do download from alma linux:
- wget -O /tmp/hwloc-libs-1.11.9-3.el8.x86_64.rpm wget https://repo.almalinux.org/almalinux/8.3-beta/BaseOS/x86_64/os/Packages/hwloc-libs-1.11.9-3.el8.x86_64.rpm
- yum install -y /tmp/hwloc-libs-1.11.9-3.el8.x86_64.rpm
- rm -f /tmp/hwloc-libs-1.11.9-3.el8.x86_64.rpm

Can we redistribute in release artifacts?

Modifing stack softlimit

Hi,
I've noticed cyclecloud recently changed the behavior for limits of stack sizes.
Now it add this:

$ cat /etc/security/limits.conf |grep stack
#        - stack - max stack size (KB)
*               hard    stack           unlimited
*               soft    stack           unlimited

However I am not sure where this comes from, I can't find it in this repo and it is not from the CentOS HPC Image as far as I could tell (https://github.com/openlogic/AzureBuildCentOS)

In any case if someone else is falling over this, Abaqus at least does not accept unlimited as a soft limit.

Greetings
Klaas

Slot_type seems to be ignored when provisioning nodes

I have a cyclecloud cluster generated from a modified version of the PBSpro template. In it I have added different types of nodes for execution, such as memory optimized nodes and HPC nodes for heavy duty numerical simulations.

When scheduling a job making use of the #PBS -l slot_type=name_slot I have found that sometimes resources are allocated that do not match the configuration of the slot. The number of nodes and processes per node is respected but the actual type of node is not. Which can result in reduced performance for certain applications.

/etc/profile.d/azpbs_autocomplete.sh is breaking PBS Dataservice restart

Context

Standalone Scheduler VM
OpenPBS 22.05.11
Ubuntu 20.04
cyclecloud-pbspro 2.0.21

after executing this

    ./initialize_pbs.sh 
    ./initialize_default_queues.sh
    ./install.sh  --install-python3 --venv /opt/cycle/pbspro/venv --cron-method pbs_hook

PBS is failing to restart :

root@scheduler:/etc/profile.d# /opt/pbs/libexec/pbs_init.d stop
Stopping PBS
Shutting server down with qterm.
PBS server - was pid: 103121
PBS sched - was pid: 103025
PBS comm - was pid: 103010
Waiting for shutdown to complete
root@scheduler:/etc/profile.d# /opt/pbs/libexec/pbs_init.d start
Starting PBS
/opt/pbs/sbin/pbs_comm ready (pid=104252), Proxy Name:scheduler.internal.cloudapp.net:17001, Threads:4
PBS comm
PBS sched
-sh: 32: eval: Syntax error: "(" unexpected (expecting "}")
-sh: 32: eval: Syntax error: "(" unexpected (expecting "}")
Failed to start PBS dataservice. 
-sh: 32: eval: Syntax error: "(" unexpected (expecting "}")
Failed to start PBS dataservice. 
-sh: 32: eval: Syntax error: "(" unexpected (expecting "}")
-sh: 32: eval: Syntax error: "(" unexpected (expecting "}")
Failed to start PBS dataservice. 
-sh: 32: eval: Syntax error: "(" unexpected (expecting "}")
Failed to start PBS dataservice. 
-sh: 32: eval: Syntax error: "(" unexpected (expecting "}")
-sh: 32: eval: Syntax error: "(" unexpected (expecting "}")
Failed to start PBS dataservice. 
^C
root@scheduler:/etc/profile.d#

Workaround

after removing /etc/profile.d/azpbs_autocomplete.sh it works again

root@scheduler:/etc/profile.d# rm azpbs_autocomplete.sh 
root@scheduler:/etc/profile.d# /opt/pbs/libexec/pbs_init.d start
Starting PBS
PBS comm already running.
PBS scheduler already running.
/opt/pbs/sbin/pbs_ds_systemd: 43: [: xdegraded: unexpected operator
Connecting to PBS dataservice....connected to PBS [email protected]
PBS server

Add Ubuntu 18 support with OpenPBS

OpenPBS 20 releases install packages for Ubuntu 18.04. It would be good if we support this as well.

This repo is missing important files

There are important files that Microsoft projects should all have that are not present in this repository. A pull request has been opened to add the missing file(s). When the pr is merged this issue will be closed automatically.

Microsoft teams can learn more about this effort and share feedback within the open source guidance available internally.

Merge this pull request

cyclecloud-pbspro-pkg-2.0.9.tar.gz is not found.

There is not pkg file for 2.0.9. Please fix it.

https://github.com/Azure/cyclecloud-pbspro/releases/download/2.0.9/cyclecloud-pbspro-pkg-2.0.9.tar.gz

Issues with output files and working directory

Hello,
I'm having some trouble with the working directory and job outputs. The jobs run fine but any output file is nowhere to be found. The details of a job follow:

Job Id: 0.ip-0A000204
    Job_Name = HPCG31_4
    Job_Owner = afernandez@ip-0a000204.iy4jdcoj0c5exd4bbj432msnqd.xx.internal.cloud
        app.net
    resources_used.cpupercent = 326
    resources_used.cput = 00:03:19
    resources_used.mem = 1889692kb
    resources_used.ncpus = 4
    resources_used.vmem = 3353476kb
    resources_used.walltime = 00:00:54
    job_state = E
    queue = workq
    server = ip-0A000204
    Checkpoint = u
    ctime = Thu Dec 19 19:56:48 2019
    Error_Path = ip-0a000204.iy4jdcoj0c5exd4bbj432msnqd.xx.internal.cloudapp.ne
        t:/home/afernandez/HPCG31_4.e0
    exec_host = ip-0A000205/0*2+ip-0A000206/0*2
    exec_vnode = (ip-0A000205:ncpus=2)+(ip-0A000206:ncpus=2)
    Hold_Types = n
    Join_Path = oe
    Keep_Files = n
    Mail_Points = a
    mtime = Thu Dec 19 20:02:58 2019
    Output_Path = ip-0a000204.iy4jdcoj0c5exd4bbj432msnqd.xx.internal.cloudapp.n
        et:/home/afernandez/hpcg31_4.out
    Priority = 0
    qtime = Thu Dec 19 19:56:48 2019
    Rerunable = True
    Resource_List.mpiprocs = 4
    Resource_List.ncpus = 4
    Resource_List.nodect = 2
    Resource_List.nodes = 2:ppn=2
    Resource_List.place = scatter:group=group_id
    Resource_List.select = 2:ncpus=2:mpiprocs=2
    Resource_List.slot_type = execute
    Resource_List.ungrouped = false
    Resource_List.walltime = 100:30:00
    stime = Thu Dec 19 20:02:04 2019
    session_id = 6412
    jobdir = /home/afernandez
    substate = 51
    Variable_List = PBS_O_SYSTEM=Linux,PBS_O_SHELL=/bin/bash,
        PBS_O_HOME=/home/afernandez,PBS_O_LOGNAME=afernandez,
        PBS_O_WORKDIR=/home/afernandez,PBS_O_LANG=en_US.UTF-8,
        PBS_O_PATH=/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/cycl
        e/jetpack/bin:/opt/pbs/bin:/opt/openmpi/bin:/home/afernandez/.local/bin:/ho
        me/afernandez/bin,PBS_O_MAIL=/var/spool/mail/afernandez,PBS_O_QUEUE=workq,
        PBS_O_HOST=ip-0a000204.iy4jdcoj0c5exd4bbj432msnqd.xx.internal.cloudapp
        .net
    comment = Job run at Thu Dec 19 at 20:02 on (ip-0A000205:ncpus=2)+(ip-0A000
        206:ncpus=2)
    etime = Thu Dec 19 19:56:49 2019
    run_count = 1
    Exit_status = 0
    Submit_arguments = HPCG31_4.sh
    pset = group_id=single
    project = _pbs_project_default

I don't understand why the output, error and working paths are showing as the IP plus iy4jdcoj0c5exd4bbj432msnqd.xx.internal.cloudapp.net or if this the reason preventing the files from showing up in the home directory.
Thanks.

Add option to disable reverse DNS validation

Jetpack error while deploying pbspro cluster

Hello,
I'm trying to deploy a pbspro cluster with customized images. However, I keep running into the following error

Check /opt/cycle/jetpack/logs/installation.log for more information
Get more help on this issue
Detail:

Traceback (most recent call last):
  File "/opt/cycle/jetpack/system/embedded/lib/python2.7/site-packages/jetpack/admin/validate.py", line 27, in execute
    jetpack.util.test_cyclecloud_connection(connection)
  File "/opt/cycle/jetpack/system/embedded/lib/python2.7/site-packages/jetpack/util.py", line 439, in test_cyclecloud_connection
    r = _connect_to_cyclecloud(connection, path)
  File "/opt/cycle/jetpack/system/embedded/lib/python2.7/site-packages/jetpack/util.py", line 482, in _connect_to_cyclecloud
    conn, headers = jetpack.util.get_cyclecloud_connection(config)
  File "/opt/cycle/jetpack/system/embedded/lib/python2.7/site-packages/jetpack/util.py", line 380, in get_cyclecloud_connection
    if jetpack.config.get('cyclecloud.skip_ssl_validation', default_skip_ssl_validation):
  File "/opt/cycle/jetpack/system/embedded/lib/python2.7/site-packages/jetpack/config.py", line 29, in get
    raise ConfigError(UNKNOWN_ERROR)
ConfigError: An unknown error occurred while processing the configuration data file.

I've boiled down the customized image to an updated CentOS configuration where 'cmake' is installed. I'm unfamiliar with Jetpack so I cannot figure out which is the origin of the problem or how to fix it. Thanks.

Use queue's available_resources when autoscaling

The integration currently only considers the MaxCoreCount set in the nodearray to limit autoscaling. PBS has limits at a queue level that we are not considering. We are already parsing and storing this, in PBSQueue.resource_state.available_resources, we just need to propagate that as a limit to the jobs.

Excellent internal write up - see Ava ticket 2406060060001613

cyclecloud-pbspro scalelib module links to /Users/ryhamel/code/cyclecloud-scalelib/

Pretty sure this was not intended

autoscaler is not adding nodes

running a non mpi job using “-l select=1:slot_type=execute:ungrouped=true” as a select statement.
The execute node array is not spot
But the autoscaler is not adding a new node

[xpillons@ondemand ~]$ qstat -fx 1651
Job Id: 1651.scheduler
Job_Name = sys-dashboard-sys-codeserver
Job_Owner = [email protected]
job_state = Q
queue = workq
server = scheduler
Checkpoint = u
ctime = Fri Nov 19 09:56:25 2021
Error_Path = ondemand.internal.cloudapp.net:/anfhome/xpillons/ondemand/data
/sys/dashboard/batch_connect/sys/codeserver/output/c1144623-b9b5-44a2-8
5b1-93fa66a0dc14/sys-dashboard-sys-codeserver.e1651
Hold_Types = n
Join_Path = oe
Keep_Files = n
Mail_Points = a
mtime = Fri Nov 19 09:56:25 2021
Output_Path = ondemand.internal.cloudapp.net:/anfhome/xpillons/ondemand/dat
a/sys/dashboard/batch_connect/sys/codeserver/output/c1144623-b9b5-44a2-
85b1-93fa66a0dc14/output.log
Priority = 0
qtime = Fri Nov 19 09:56:25 2021
Rerunable = True
Resource_List.ncpus = 1
Resource_List.nodect = 1
Resource_List.place = scatter:excl
Resource_List.select = 1:slot_type=execute:ungrouped=true
Resource_List.slot_type = execute
Resource_List.ungrouped = false
Resource_List.walltime = 03:00:00
Shell_Path_List = /bin/bash
substate = 10
Variable_List = PBS_O_HOME=/anfhome/xpillons,PBS_O_LANG=C,
PBS_O_LOGNAME=xpillons,
PBS_O_PATH=/var/www/ood/apps/sys/dashboard/tmp/node_modules/yarn/bin:/
opt/ood/ondemand/root/usr/share/gems/2.7/bin:/opt/rh/rh-nodejs12/root/u
sr/bin:/opt/rh/rh-ruby27/root/usr/local/bin:/opt/rh/rh-ruby27/root/usr/
bin:/opt/rh/httpd24/root/usr/bin:/opt/rh/httpd24/root/usr/sbin:/opt/ood
/ondemand/root/usr/bin:/opt/ood/ondemand/root/usr/sbin:/sbin:/bin:/usr/
sbin:/usr/bin,PBS_O_MAIL=/var/mail/root,PBS_O_SHELL=/bin/bash,
PBS_O_WORKDIR=/anfhome/xpillons/ondemand/data/sys/dashboard/batch_conn
ect/sys/codeserver/output/c1144623-b9b5-44a2-85b1-93fa66a0dc14,
PBS_O_SYSTEM=Linux,PBS_O_QUEUE=workq,
PBS_O_HOST=ondemand.internal.cloudapp.net
etime = Fri Nov 19 09:56:25 2021
Submit_arguments = -N sys-dashboard-sys-codeserver -S /bin/bash -o /anfhome
/xpillons/ondemand/data/sys/dashboard/batch_connect/sys/codeserver/outp
ut/c1144623-b9b5-44a2-85b1-93fa66a0dc14/output.log -j oe -l select=1:sl
ot_type=execute:ungrouped=true -l walltime=03:00:00
project = _pbs_project_default

[root@scheduler ~]# azpbs analyze --job-id 1651
NotInAPlacementGroup : Bucket[array=execute vm_size=Standard_F2s_v2 id=ac4fc82f-3d82-4f20-bd6e-67299dcdd388] is not in a placement group
NotInAPlacementGroup : Bucket[array=hb120rs_v2 vm_size=Standard_HB120rs_v2 id=4167f5f8-3a30-46df-92ba-8f77e0b636cd] is not in a placement group
NotInAPlacementGroup : Bucket[array=hb120rs_v3 vm_size=Standard_HB120rs_v3 id=aed72457-3b3c-44a0-ab8f-9ebb1a954fee] is not in a placement group
NotInAPlacementGroup : Bucket[array=hb60rs vm_size=Standard_HB60rs id=53bb8fd0-e62e-468f-b278-93da9e6fef3e] is not in a placement group
NotInAPlacementGroup : Bucket[array=hc44rs vm_size=Standard_HC44rs id=be3124ac-3894-496a-8580-ef06aba68273] is not in a placement group
NotInAPlacementGroup : Bucket[array=viz vm_size=Standard_D8s_v3 id=03e27828-b4bd-4678-a8e5-5886599bc6e5] is not in a placement group
NotInAPlacementGroup : Bucket[array=viz3d vm_size=Standard_NV6 id=ab76e8b1-0632-49a9-be16-93d880341855] is not in a placement group
InvalidOption : Resource[name=ungrouped value='false'] != 'true' for Bucket[array=execute vm_size=Standard_F2s_v2 attr=ungrouped]
InvalidOption : Resource[name=ungrouped value='false'] != 'true' for Bucket[array=execute vm_size=Standard_F2s_v2 attr=ungrouped]
InvalidOption : Resource[name=slot_type value='hb120rs_v2'] != 'execute' for Bucket[array=hb120rs_v2 vm_size=Standard_HB120rs_v2 attr=slot_type]
InvalidOption : Resource[name=slot_type value='hb120rs_v2'] != 'execute' for Bucket[array=hb120rs_v2 vm_size=Standard_HB120rs_v2 attr=slot_type]
InvalidOption : Resource[name=slot_type value='hb120rs_v3'] != 'execute' for Bucket[array=hb120rs_v3 vm_size=Standard_HB120rs_v3 attr=slot_type]
InvalidOption : Resource[name=slot_type value='hb120rs_v3'] != 'execute' for Bucket[array=hb120rs_v3 vm_size=Standard_HB120rs_v3 attr=slot_type]
InvalidOption : Resource[name=slot_type value='hb60rs'] != 'execute' for Bucket[array=hb60rs vm_size=Standard_HB60rs attr=slot_type]
InvalidOption : Resource[name=slot_type value='hb60rs'] != 'execute' for Bucket[array=hb60rs vm_size=Standard_HB60rs attr=slot_type]
InvalidOption : Resource[name=slot_type value='hc44rs'] != 'execute' for Bucket[array=hc44rs vm_size=Standard_HC44rs attr=slot_type]
InvalidOption : Resource[name=slot_type value='hc44rs'] != 'execute' for Bucket[array=hc44rs vm_size=Standard_HC44rs attr=slot_type]
InvalidOption : Resource[name=slot_type value='viz'] != 'execute' for Bucket[array=viz vm_size=Standard_D8s_v3 attr=slot_type]
InvalidOption : Resource[name=slot_type value='viz'] != 'execute' for Bucket[array=viz vm_size=Standard_D8s_v3 attr=slot_type]
InvalidOption : Resource[name=slot_type value='viz3d'] != 'execute' for Bucket[array=viz3d vm_size=Standard_NV6 attr=slot_type]
InvalidOption : Resource[name=slot_type value='viz3d'] != 'execute' for Bucket[array=viz3d vm_size=Standard_NV6 attr=slot_type]

NoCandidatesFound : SatisfiedResult(status=NotInAPlacementGroup, node=NodeBucket(nodearray=execute, vm_size=Standard_F2s_v2, pg=None),reason=Bucket[array=execute vm_size=Standard_F2s_v2 id=ac4fc82f-3d82-4f20-bd6e-67299dcdd388] is not
SatisfiedResult(status=NotInAPlacementGroup, node=NodeBucket(nodearray=hb120rs_v2, vm_size=Standard_HB120rs_v2, pg=None),reason=Bucket[array=hb120rs_v2 vm_size=Standard_HB120rs_v2 id=4167f5f8-3a30-46df-92ba-8f77e0
SatisfiedResult(status=NotInAPlacementGroup, node=NodeBucket(nodearray=hb120rs_v3, vm_size=Standard_HB120rs_v3, pg=None),reason=Bucket[array=hb120rs_v3 vm_size=Standard_HB120rs_v3 id=aed72457-3b3c-44a0-ab8f-9ebb1a
SatisfiedResult(status=NotInAPlacementGroup, node=NodeBucket(nodearray=hb60rs, vm_size=Standard_HB60rs, pg=None),reason=Bucket[array=hb60rs vm_size=Standard_HB60rs id=53bb8fd0-e62e-468f-b278-93da9e6fef3e] is not i
SatisfiedResult(status=NotInAPlacementGroup, node=NodeBucket(nodearray=hc44rs, vm_size=Standard_HC44rs, pg=None),reason=Bucket[array=hc44rs vm_size=Standard_HC44rs id=be3124ac-3894-496a-8580-ef06aba68273] is not i
SatisfiedResult(status=NotInAPlacementGroup, node=NodeBucket(nodearray=viz, vm_size=Standard_D8s_v3, pg=None),reason=Bucket[array=viz vm_size=Standard_D8s_v3 id=03e27828-b4bd-4678-a8e5-5886599bc6e5] is not in a pl
SatisfiedResult(status=NotInAPlacementGroup, node=NodeBucket(nodearray=viz3d, vm_size=Standard_NV6, pg=None),reason=Bucket[array=viz3d vm_size=Standard_NV6 id=ab76e8b1-0632-49a9-be16-93d880341855] is not in a plac
SatisfiedResult(status=InvalidOption, node=NodeBucket(nodearray=execute, vm_size=Standard_F2s_v2, pg=Standard_F2s_v2_pg0),reason=Resource[name=ungrouped value='false'] != 'true' for Bucket[array=execute vm_size=St
SatisfiedResult(status=InvalidOption, node=NodeBucket(nodearray=execute, vm_size=Standard_F2s_v2, pg=Standard_F2s_v2_pg1),reason=Resource[name=ungrouped value='false'] != 'true' for Bucket[array=execute vm_size=St
SatisfiedResult(status=InvalidOption, node=NodeBucket(nodearray=hb120rs_v2, vm_size=Standard_HB120rs_v2, pg=Standard_HB120rs_v2_pg0),reason=Resource[name=slot_type value='hb120rs_v2'] != 'execute' for Bucket[array
SatisfiedResult(status=InvalidOption, node=NodeBucket(nodearray=hb120rs_v2, vm_size=Standard_HB120rs_v2, pg=Standard_HB120rs_v2_pg1),reason=Resource[name=slot_type value='hb120rs_v2'] != 'execute' for Bucket[array
SatisfiedResult(status=InvalidOption, node=NodeBucket(nodearray=hb120rs_v3, vm_size=Standard_HB120rs_v3, pg=Standard_HB120rs_v3_pg0),reason=Resource[name=slot_type value='hb120rs_v3'] != 'execute' for Bucket[array
SatisfiedResult(status=InvalidOption, node=NodeBucket(nodearray=hb120rs_v3, vm_size=Standard_HB120rs_v3, pg=Standard_HB120rs_v3_pg1),reason=Resource[name=slot_type value='hb120rs_v3'] != 'execute' for Bucket[array
SatisfiedResult(status=InvalidOption, node=NodeBucket(nodearray=hb60rs, vm_size=Standard_HB60rs, pg=Standard_HB60rs_pg0),reason=Resource[name=slot_type value='hb60rs'] != 'execute' for Bucket[array=hb60rs vm_size=
SatisfiedResult(status=InvalidOption, node=NodeBucket(nodearray=hb60rs, vm_size=Standard_HB60rs, pg=Standard_HB60rs_pg1),reason=Resource[name=slot_type value='hb60rs'] != 'execute' for Bucket[array=hb60rs vm_size=
SatisfiedResult(status=InvalidOption, node=NodeBucket(nodearray=hc44rs, vm_size=Standard_HC44rs, pg=Standard_HC44rs_pg0),reason=Resource[name=slot_type value='hc44rs'] != 'execute' for Bucket[array=hc44rs vm_size=
SatisfiedResult(status=InvalidOption, node=NodeBucket(nodearray=hc44rs, vm_size=Standard_HC44rs, pg=Standard_HC44rs_pg1),reason=Resource[name=slot_type value='hc44rs'] != 'execute' for Bucket[array=hc44rs vm_size=
SatisfiedResult(status=InvalidOption, node=NodeBucket(nodearray=viz, vm_size=Standard_D8s_v3, pg=Standard_D8s_v3_pg0),reason=Resource[name=slot_type value='viz'] != 'execute' for Bucket[array=viz vm_size=Standard_
SatisfiedResult(status=InvalidOption, node=NodeBucket(nodearray=viz, vm_size=Standard_D8s_v3, pg=Standard_D8s_v3_pg1),reason=Resource[name=slot_type value='viz'] != 'execute' for Bucket[array=viz vm_size=Standard_
SatisfiedResult(status=InvalidOption, node=NodeBucket(nodearray=viz3d, vm_size=Standard_NV6, pg=Standard_NV6_pg0),reason=Resource[name=slot_type value='viz3d'] != 'execute' for Bucket[array=viz3d vm_size=Standard_
SatisfiedResult(status=InvalidOption, node=NodeBucket(nodearray=viz3d, vm_size=Standard_NV6, pg=Standard_NV6_pg1),reason=Resource[name=slot_type value='viz3d'] != 'execute' for Bucket[array=viz3d vm_size=Standard_

[root@scheduler ~]# azpbs buckets
NODEARRAY PLACEMENT_GROUP VM_SIZE VCPU_COUNT PCPU_COUNT MEMORY AVAILABLE_COUNT NCPUS NGPUS DISK HOST SLOT_TYPE GROUP_ID MEM CCNODEID UNGROUPED
execute Standard_F2s_v2 2 1 4.00g 512 1 0 20.00g execute none 4.00g true
execute Standard_F2s_v2_pg0 Standard_F2s_v2 2 1 4.00g 100 1 0 20.00g execute Standard_F2s_v2_pg0 4.00g false
execute Standard_F2s_v2_pg1 Standard_F2s_v2 2 1 4.00g 100 1 0 20.00g execute Standard_F2s_v2_pg1 4.00g false
hb120rs_v2 Standard_HB120rs_v2 120 120 456.00g 72 120 0 20.00g hb120rs_v2 none 456.00g true
hb120rs_v2 Standard_HB120rs_v2_pg0 Standard_HB120rs_v2 120 120 456.00g 72 120 0 20.00g hb120rs_v2 Standard_HB120rs_v2_pg0 456.00g false
hb120rs_v2 Standard_HB120rs_v2_pg1 Standard_HB120rs_v2 120 120 456.00g 72 120 0 20.00g hb120rs_v2 Standard_HB120rs_v2_pg1 456.00g false
hb120rs_v3 Standard_HB120rs_v3 120 120 448.00g 10 120 0 20.00g hb120rs_v3 none 448.00g true
hb120rs_v3 Standard_HB120rs_v3_pg0 Standard_HB120rs_v3 120 120 448.00g 10 120 0 20.00g hb120rs_v3 Standard_HB120rs_v3_pg0 448.00g false
hb120rs_v3 Standard_HB120rs_v3_pg1 Standard_HB120rs_v3 120 120 448.00g 10 120 0 20.00g hb120rs_v3 Standard_HB120rs_v3_pg1 448.00g false
hb60rs Standard_HB60rs 60 60 228.00g 40 60 0 20.00g hb60rs none 228.00g true
hb60rs Standard_HB60rs_pg0 Standard_HB60rs 60 60 228.00g 40 60 0 20.00g hb60rs Standard_HB60rs_pg0 228.00g false
hb60rs Standard_HB60rs_pg1 Standard_HB60rs 60 60 228.00g 40 60 0 20.00g hb60rs Standard_HB60rs_pg1 228.00g false
hc44rs Standard_HC44rs 44 44 352.00g 40 44 0 20.00g hc44rs none 352.00g true
hc44rs Standard_HC44rs_pg0 Standard_HC44rs 44 44 352.00g 40 44 0 20.00g hc44rs Standard_HC44rs_pg0 352.00g false
hc44rs Standard_HC44rs_pg1 Standard_HC44rs 44 44 352.00g 40 44 0 20.00g hc44rs Standard_HC44rs_pg1 352.00g false
viz Standard_D8s_v3 8 4 32.00g 50 4 0 20.00g viz none 32.00g true
viz Standard_D8s_v3_pg0 Standard_D8s_v3 8 4 32.00g 50 4 0 20.00g viz Standard_D8s_v3_pg0 32.00g false
viz Standard_D8s_v3_pg1 Standard_D8s_v3 8 4 32.00g 50 4 0 20.00g viz Standard_D8s_v3_pg1 32.00g false
viz3d Standard_NV6 6 6 56.00g 10 6 1 20.00g viz3d none 56.00g true
viz3d Standard_NV6_pg0 Standard_NV6 6 6 56.00g 10 6 1 20.00g viz3d Standard_NV6_pg0 56.00g false
viz3d Standard_NV6_pg1 Standard_NV6 6 6 56.00g 10 6 1 20.00g viz3d Standard_NV6_pg1 56.00g false

Error in Documentation

There's an error in the documentation for the PBS Pro configuration reference:

pbspro.version is currently set to 18.1.3-0 by default, not 14.2.1-0.

Job History is not in release 2.0.2

it seems that that version doesn't contains this fix despite what is claimed in the release page
repro :
wget https://github.com/Azure/cyclecloud-pbspro/releases/download/2.0.2/cyclecloud-pbspro-pkg-2.0.2.tar.gz
tar xvf cyclecloud-pbspro-pkg-2.0.2.tar.gz
initialize_pbs.sh doesn't contains the job history parameter.

Ignore 'm' resource flag, so that PBS version < 19 scale correctly

The autoscaler incorrectly sees resources with flag=hnq as static, rather than consumable. This is only an issue with version of PBS before 19.

Nodes are most of the time not unregistered in PBS

If I run azpbs autoscale here is the output, these nodes have been deprovisioned by Cycle and no longer exists.
[root@scheduler hpcadmin]# azpbs autoscale
2021-04-01 09:11:18,334 ERROR: Could not convert private_ip(None) to hostname using gethostbyaddr() for SchedulerNode(254rq000000, 254rq000000, unknown, None): gethostbyaddr() argument 1 must be str, bytes or bytearray, not None
NAME HOSTNAME PBS_STATE JOB_IDS STATE VM_SIZE DISK NGPUS GROUP_ID MACHINETYPE MEM NCPUS NODEARRAY SLOT_TYPE UNGROUPED INSTANCE_ID CTR ITR
254rq000000 254rq000000 down running unknown 20.00gb/20.00gb 0/0 s_v2_pg0 456.00gb 120 unknown hb120rs_v2 false 0.0 0.0
[root@scheduler hpcadmin]#

Request for PBS Pro 19.1.2

I would like to use PBS Pro 19.1.2 to match the version with other production environments.
The detailed version of CentOS 7 is fine as long as it is the stable version.

job history is disabled by default

in the feature/2.0.0 branch, please enable job history by default as it will help with accounting and troubleshooting

autoscaler not handling well bad formatted JSON qstat output

cyclecloud-pbspro version 2.0.10

With OpenPBS 19.1.1 output in JSON for qstat can be bad formatted in case of complex environment variables.
For example : qstat -f <job_id> -F json | jq '.' will return an error meaning the JSON is bad formatted.
As a result this make the autoscaler to stop working, so a single bad job can hang all the whole system and no new nodes can be added.

Here the output of azpbs autoscale

File "/opt/cycle/pbspro/venv/lib/python3.6/site-packages/pbspro/environment.py", line 58, in from_driver
--- Logging error ---
Traceback (most recent call last):
File "/usr/lib64/python3.6/logging/init.py", line 996, in emit
stream.write(msg)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 47623-47628: ordinal not in range(128)
Call stack:
File "/root/bin/azpbs", line 4, in
main()
File "/opt/cycle/pbspro/venv/lib/python3.6/site-packages/pbspro/cli.py", line 284, in main
clilib.main(argv or sys.argv[1:], "pbspro", PBSCLI())
File "/opt/cycle/pbspro/venv/lib/python3.6/site-packages/hpc/autoscale/clilib.py", line 1739, in main
args.func(**kwargs)
File "/opt/cycle/pbspro/venv/lib/python3.6/site-packages/hpc/autoscale/clilib.py", line 1315, in analyze
dcalc = self._demand(config, ctx_handler=ctx_handler)
File "/opt/cycle/pbspro/venv/lib/python3.6/site-packages/hpc/autoscale/clilib.py", line 360, in _demand
dcalc, jobs = self._demand_calc(config, driver)
File "/opt/cycle/pbspro/venv/lib/python3.6/site-packages/pbspro/cli.py", line 113, in _demand_calc
pbs_env = self._pbs_env(pbs_driver)
File "/opt/cycle/pbspro/venv/lib/python3.6/site-packages/pbspro/cli.py", line 106, in _pbs_env
self.__pbs_env = environment.from_driver(pbs_driver.config, pbs_driver)
File "/opt/cycle/pbspro/venv/lib/python3.6/site-packages/pbspro/environment.py", line 58, in from_driver
jobs = pbs_driver.parse_jobs(queues, default_scheduler.resources_for_scheduling)
File "/opt/cycle/pbspro/venv/lib/python3.6/site-packages/pbspro/driver.py", line 414, in parse_jobs
self.pbscmd, self.resource_definitions, queues, resources_for_scheduling
File "/opt/cycle/pbspro/venv/lib/python3.6/site-packages/pbspro/driver.py", line 530, in parse_jobs
response: Dict = pbscmd.qstat_json("-f", "-t")
File "/opt/cycle/pbspro/venv/lib/python3.6/site-packages/pbspro/pbscmd.py", line 31, in qstat_json
response = self.qstat(*args)
File "/opt/cycle/pbspro/venv/lib/python3.6/site-packages/pbspro/pbscmd.py", line 25, in qstat
return self._check_output(cmd)
File "/opt/cycle/pbspro/venv/lib/python3.6/site-packages/pbspro/pbscmd.py", line 76, in _check_output
logger.info("Response: %s", ret)
File "/usr/lib64/python3.6/logging/init.py", line 1308, in info
self._log(INFO, msg, args, **kwargs)
File "/opt/cycle/pbspro/venv/lib/python3.6/site-packages/hpc/autoscale/hpclogging.py", line 45, in _log
**stacklevelkw
Message: 'Response: %s'

slot_type is case sensitive

If a slot type is defined as hbv3 then the following job will never start

qsub -l select=1:slot_type=HBv3 -I

azpbs remove_nodes doesn't remove them from pbsnodes

To repro, manually add nodes to pbs cluster with add nodes in the UI. They get joined to the cluster. Then remove them with:

azpbs remove_nodes -H ip-0A010907 -H ip-0A010908 --force

Note that --force is required. They seem to temporarily go through the "down state" but recover to free.

Does this command work?

[root@ip-0A010906 ~]# pbsnodes -aS
vnode           state           OS       hardware host            queue        mem     ncpus   nmics   ngpus  comment
--------------- --------------- -------- -------- --------------- ---------- -------- ------- ------- ------- ---------
ip-0A010907     down            --       --       ip-0a010907     --              4gb       1       0       0 --
ip-0A010908     state-unknown   --       --       ip-0a010908     --              4gb       1       0       0 --
[root@ip-0A010906 ~]# pbsnodes -aS
vnode           state           OS       hardware host            queue        mem     ncpus   nmics   ngpus  comment
--------------- --------------- -------- -------- --------------- ---------- -------- ------- ------- ------- ---------
ip-0A010907     free            --       --       ip-0a010907     --              4gb       1       0       0 --
ip-0A010908     initializing    --       --       ip-0a010908     --              4gb       1       0       0 --