Giter VIP home page Giter VIP logo

cyclecloud-pbspro's Introduction

Azure CycleCloud OpenPBS project

OpenPBS is a highly configurable open source workload manager. See the OpenPBS project site for an overview and the PBSpro documentation for more information on using, configuring, and troubleshooting OpenPBS in general.

Versions

OpenPBS (formerly PBS Professional OSS) is released as part of version 20.0.0. PBSPro OSS is still available in CycleCloud by specifying the PBSPro OSS version.

   [[[configuration]]]
   pbspro.version = 18.1.4-0

Installing Manually

Note: When using the cluster that is shipped with CycleCloud, the autoscaler and default queues are already installed.

First, download the installer pkg from GitHub. For example, you can download the 2.0.21 release here

# Prerequisite: python3, 3.6 or newer, must be installed and in the PATH
wget https://github.com/Azure/cyclecloud-pbspro/releases/download/2.0.21/cyclecloud-pbspro-pkg-2.0.21.tar.gz
tar xzf cyclecloud-pbspro-pkg-2.0.21.tar.gz
cd cyclecloud-pbspro
# Optional, but recommended. Adds relevant resources and enables strict placement
./initialize_pbs.sh
# Optional. Sets up workq as a colocated, MPI focused queue and creates htcq for non-MPI workloads.
./initialize_default_queues.sh

# Creates the azpbs autoscaler
./install.sh  --venv /opt/cycle/pbspro/venv

# If you have jetpack available, you may use the following:
# ./generate_autoscale_json.sh --install-dir /opt/cycle/pbspro \
#                              --username $(jetpack config cyclecloud.config.username) \
#                              --password $(jetpack config cyclecloud.config.password) \
#                              --url $(jetpack config cyclecloud.config.web_server) \
#                              --cluster-name $(jetpack config cyclecloud.cluster.name)

# Otherwise insert your username, password, url, and cluster name here.
./generate_autoscale_json.sh --install-dir /opt/cycle/pbspro \
                             --username user \
                             --password password \
                             --url https://fqdn:port \
                             --cluster-name cluster_name

# lastly, run this to understand any changes that may be required.
# For example, you typically have to add the ungrouped and group_id resources
# to the /var/spool/pbs/sched_priv/sched_priv file and restart.
## [root@scheduler cyclecloud-pbspro]# azpbs validate
## ungrouped is not defined for line 'resources:' in /var/spool/pbs/sched_priv/sched_config. Please add this and restart PBS
## group_id is not defined for line 'resources:' in /var/spool/pbs/sched_priv/sched_config. Please add this and restart PBS
azpbs validate

Autoscale and scalesets

In order to try and ensure that the correct VMs are provisioned for different types of jobs, CycleCloud treats autoscale of MPI and serial jobs differently in OpenPBS clusters.

For serial jobs, multiple VM scalesets (VMSS) are used in order to scale as quickly as possible. For MPI jobs to use the InfiniBand fabric for those instances that support it, all of the nodes allocated to the job have to be deployed in the same VMSS. CycleCloud handles this by using a PlacementGroupId that groups nodes with the same id into the same VMSS. By default, the workq appends the equivalent of -l place=scatter:group=group_id by using native queue defaults.

Hooks

Our PBS integration uses 3 different PBS hooks. autoscale does the bulk of the work required to scale the cluster up and down. All relevant log messages can be seen in /opt/cycle/pbspro/autoscale.log. cycle_sub_hook will validate jobs unless they use -l nodes syntax, in which case those jobs are held and later processed by our last hook cycle_sub_hook_periodic.

Autoscale Hook

The most important is the autoscale plugin, which runs by default on a 15 second interval. You can adjust this frequency by running

qmgr -c "set hook autoscale freq=NUM_SECONDS"

Submission Hooks

cycle_sub_hook will validate that your job has the proper placement restrictions set. If it encounters a problem, it will output a detailed message on why the job was rejected and how to resolve the issue. For example

$> echo sleep 300 | qsub -l select=2 -l place=scatter
Please do one of the following
    1) Ensure this placement is set by adding group=group_id to your -l place= statement
        Note: Queue workq's resource_defaults.place=group=group_id
    2) Add -l skipcyclesubhook=true on this job
        Note: If the resource does not exist, create it -> qmgr -c 'create resource skipcyclesubhook type=boolean'
    3) Disable this hook for this queue via queue defaults -> qmgr -c 'set queue workq resources_default.skipcyclesubhook=true'
    4) Disable this plugin - 'qmgr -c 'set hook cycle_sub_hook enabled=false'
        Note: Disabling this plugin may prevent -l nodes= style submissions from working properly.

One important note: if you are using Torque style submissions, i.e. those that uses -l nodes instead of -l select, PBS will simply convert that submission into an equivalent -l select style submission. However, the default placement defined for the queue is not respected by PBS when converting the job. To get around this, we will hold the job and our last hook, cycle_sub_hook_periodic will periodically update the job's placement and release it.

Configuring Resources

The cyclecloud-pbspro application matches PBS resources to azure cloud resources to provide rich autoscaling and cluster configuration tools. The application will be deployed automatically for clusters created via the CycleCloud UI or it can be installed on any PBS admin host on an existing cluster. For more information on defining resources in autoscale.json, see ScaleLib's documentation.

The default resources defined with the cluster template we ship with are

{"default_resources": [
   {
      "select": {},
      "name": "ncpus",
      "value": "node.vcpu_count"
   },
   {
      "select": {},
      "name": "group_id",
      "value": "node.placement_group"
   },
   {
      "select": {},
      "name": "host",
      "value": "node.hostname"
   },
   {
      "select": {},
      "name": "mem",
      "value": "node.memory"
   },
   {
      "select": {},
      "name": "vm_size",
      "value": "node.vm_size"
   },
   {
      "select": {},
      "name": "disk",
      "value": "size::20g"
   }]
}

Note that disk is currently hardcoded to size::20g because of platform limitations to determine how much disk a node will have. Here is an example of handling VM Size specific disk size

   {
      "select": {"node.vm_size": "Standard_F2"},
      "name": "disk",
      "value": "size::20g"
   },
   {
      "select": {"node.vm_size": "Standard_H44rs"},
      "name": "disk",
      "value": "size::2t"
   }

azpbs cli

The azpbs cli is the main interface for all autoscaling behavior. Note that it has a fairly powerful autocomplete capabilities. For example, typing azpbs create_nodes --vm-size and then you can tab-complete the list of possible VM Sizes. Autocomplete information is updated every azpbs autoscale cycle, but can also be refreshed manually by running azpbs refresh_autocomplete.

Command Description
autoscale End-to-end autoscale process, including creation, deletion and joining of nodes.
buckets Prints out autoscale bucket information, like limits etc
config Writes the effective autoscale config, after any preprocessing, to stdout
create_nodes Create a set of nodes given various constraints. A CLI version of the nodemanager interface.
default_output_columns Output what are the default output columns for an optional command.
delete_nodes Deletes node, including draining post delete handling
demand Dry-run version of autoscale.
initconfig Creates an initial autoscale config. Writes to stdout
jobs Writes out autoscale jobs as json. Note: Running jobs are excluded.
join_nodes Adds selected nodes to the scheduler
limits Writes a detailed set of limits for each bucket. Defaults to json due to number of fields.
nodes Query nodes
refresh_autocomplete Refreshes local autocomplete information for cluster specific resources and nodes.
remove_nodes Removes the node from the scheduler without terminating the actual instance.
retry_failed_nodes Retries all nodes in a failed state.
shell Interactive python shell with relevant objects in local scope. Use --script to run python scripts
validate Runs basic validation of the environment
validate_constraint Validates then outputs as json one or more constraints.

azpbs buckets

Use the azpbs buckets command to see which buckets of compute are available, how many are available, and what resources they have.

azpbs buckets --output-columns nodearray,placement_group,vm_size,ncpus,mem,available_count
NODEARRAY PLACEMENT_GROUP     VM_SIZE         NCPUS MEM     AVAILABLE_COUNT
execute                       Standard_F2s_v2 1     4.00g   50             
execute                       Standard_D2_v4  1     8.00g   50             
execute                       Standard_E2s_v4 1     16.00g  50             
execute                       Standard_NC6    6     56.00g  16             
execute                       Standard_A11    16    112.00g 6              
execute   Standard_F2s_v2_pg0 Standard_F2s_v2 1     4.00g   50             
execute   Standard_F2s_v2_pg1 Standard_F2s_v2 1     4.00g   50             
execute   Standard_D2_v4_pg0  Standard_D2_v4  1     8.00g   50             
execute   Standard_D2_v4_pg1  Standard_D2_v4  1     8.00g   50             
execute   Standard_E2s_v4_pg0 Standard_E2s_v4 1     16.00g  50             
execute   Standard_E2s_v4_pg1 Standard_E2s_v4 1     16.00g  50             
execute   Standard_NC6_pg0    Standard_NC6    6     56.00g  16             
execute   Standard_NC6_pg1    Standard_NC6    6     56.00g  16             
execute   Standard_A11_pg0    Standard_A11    16    112.00g 6              
execute   Standard_A11_pg1    Standard_A11    16    112.00g 6

azpbs demand

It is common that you want to test out autoscaling without actually allocating anything. azpbs demand is a dry-run version of azpbs autoscale. Here is a simple example where we allocate two machines for a simple -l select=2 submission. As you can see, job id 1 is using one ncpus on two different nodes.

azpbs demand
NAME      JOB_IDS NCPUS
execute-1 1       0/1  
execute-2 1       0/1

azpbs create_nodes

Manually creating nodes via azpbs create_nodes is also quite powerful. Note that it also has a --dry-run mode as well.

Here is an example of allocating 100 slots of mem=memory::1g or 1gb partitions. Since our nodes have 4gb each, then we expect 25 nodes to be created.

azpbs create_nodes --keep-alive --vm-size Standard_F2s_v2 --slots 100 --constraint-expr mem=memory::1g --dry-run --output-columns name,/mem
NAME       MEM        
execute-1  0.00g/4.00g
...
execute-25 0.00g/4.00g

azpbs delete_/remove_nodes

azpbs supports safely removing a node from PBS. The different between delete_nodes and remove_nodes is simply that delete_nodes, on top of removing the node from PBS, will also delete the node. You may delete by hostname or node name. Pass in * to delete/remove all nodes.

azpbs shell

azpbs shell is a more advanced command that can be quit powerful. This command fully constructs the entire in-memory structures used by azpbs autoscale to allow the user to interact with them dynamically. All of the objects are passed in to the local scope, and can be listd by calling pbsprohelp(). This is a powerful debugging tool.

[root@pbsserver ~] azpbs shell
CycleCloud Autoscale Shell
>>> pbsprohelp()
config               - dict representing autoscale configuration.
cli                  - object representing the CLI commands
pbs_env              - object that contains data structures for queues, resources etc
queues               - dict of queue name -> PBSProQueue object
jobs                 - dict of job id -> Autoscale Job
scheduler_nodes      - dict of hostname -> node objects. These represent purely what the scheduler sees without additional booting nodes / information from CycleCloud
resource_definitions - dict of resource name -> PBSProResourceDefinition objects.
default_scheduler    - PBSProScheduler object representing the default scheduler.
pbs_driver           - PBSProDriver object that interacts directly with PBS and implements PBS specific behavior for scalelib.
demand_calc          - ScaleLib DemandCalculator - pseudo-scheduler that determines the what nodes are unnecessary
node_mgr             - ScaleLib NodeManager - interacts with CycleCloud for all node related activities - creation, deletion, limits, buckets etc.
pbsprohelp            - This help function
>>> queues.workq.resources_default
{'place': 'scatter:group=group_id'}
>>> jobs["0"].node_count
2

azpbs shell can also take in as an argument --script path/to/python_file.py, allowing the user to have full access to the in-memory structures, again by passing in the objects through the local scope, to customize the autoscale behavior.

[root@pbsserver ~] cat example.py 
for bucket in node_mgr.get_buckets():
    print(bucket.nodearray, bucket.vm_size, bucket.available_count)

[root@pbsserver ~] azpbs shell -s example.py 
execute Standard_F2s_v2 50
execute Standard_D2_v4 50
execute Standard_E2s_v4 50

Timeouts

By default we set idle and boot timeouts across all nodes.

   "boot_timeout": 3600

You can also set these per nodearray.

   "boot_timeout": {"default": 3600, "nodearray1": 7200, "nodearray2": 900},

Logging

By default, azpbs will use /opt/cycle/pbspro/logging.conf, as defined in /opt/cycle/pbsspro/autoscale.json. This will create the following logs.

/opt/cycle/pbspro/autoscale.log

autoscale.log is the main log for all azpbs invocations.

/opt/cycle/pbspro/qcmd.log

qcmd.log every PBS executable invocation and the response, so you can see exactly what commands are being run.

/opt/cycle/pbspro/demand.log

Every autoscale iteration, azpbs prints out a table of all of the nodes, their resources, their assigned jobs and more. This log contains these values and nothing else.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.microsoft.com.

When you submit a pull request, a CLA-bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

cyclecloud-pbspro's People

Contributors

aditigaur4 avatar adriankjohnson avatar anhoward avatar atomic-penguin avatar bwatrous avatar dpwatrous avatar hmeiland avatar jermth avatar microsoft-github-policy-service[bot] avatar microsoftopensource avatar msftgits avatar ryanhamel avatar staer avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cyclecloud-pbspro's Issues

Add HBv3 support

HBv3 resources are not recognized by scalelib.
Workaround is to add them at the beginning of the default_resources in the autoscale.json

"default_resources": [
{
"name": "ncpus",
"select": {
"node.vm_size": "Standard_HB120rs_v3"
},
"value": 120
},
{
"name": "mem",
"select": {
"node.vm_size": "Standard_HB120rs_v3"
},
"value": "memory::448g"
},
{
"name": "ncpus",
"select": {
"node.vm_size": "Standard_HB120-96rs_v3"
},
"value": 96
},
{
"name": "mem",
"select": {
"node.vm_size": "Standard_HB120-96rs_v3"
},
"value": "memory::448g"
},
{
"name": "ncpus",
"select": {
"node.vm_size": "Standard_HB120-64rs_v3"
},
"value": 64
},
{
"name": "mem",
"select": {
"node.vm_size": "Standard_HB120-64rs_v3"
},
"value": "memory::448g"
},
{
"name": "ncpus",
"select": {
"node.vm_size": "Standard_HB120-32rs_v3"
},
"value": 32
},
{
"name": "mem",
"select": {
"node.vm_size": "Standard_HB120-32rs_v3"
},
"value": "memory::448g"
}

Issues with output files and working directory

Hello,
I'm having some trouble with the working directory and job outputs. The jobs run fine but any output file is nowhere to be found. The details of a job follow:

Job Id: 0.ip-0A000204
    Job_Name = HPCG31_4
    Job_Owner = afernandez@ip-0a000204.iy4jdcoj0c5exd4bbj432msnqd.xx.internal.cloud
        app.net
    resources_used.cpupercent = 326
    resources_used.cput = 00:03:19
    resources_used.mem = 1889692kb
    resources_used.ncpus = 4
    resources_used.vmem = 3353476kb
    resources_used.walltime = 00:00:54
    job_state = E
    queue = workq
    server = ip-0A000204
    Checkpoint = u
    ctime = Thu Dec 19 19:56:48 2019
    Error_Path = ip-0a000204.iy4jdcoj0c5exd4bbj432msnqd.xx.internal.cloudapp.ne
        t:/home/afernandez/HPCG31_4.e0
    exec_host = ip-0A000205/0*2+ip-0A000206/0*2
    exec_vnode = (ip-0A000205:ncpus=2)+(ip-0A000206:ncpus=2)
    Hold_Types = n
    Join_Path = oe
    Keep_Files = n
    Mail_Points = a
    mtime = Thu Dec 19 20:02:58 2019
    Output_Path = ip-0a000204.iy4jdcoj0c5exd4bbj432msnqd.xx.internal.cloudapp.n
        et:/home/afernandez/hpcg31_4.out
    Priority = 0
    qtime = Thu Dec 19 19:56:48 2019
    Rerunable = True
    Resource_List.mpiprocs = 4
    Resource_List.ncpus = 4
    Resource_List.nodect = 2
    Resource_List.nodes = 2:ppn=2
    Resource_List.place = scatter:group=group_id
    Resource_List.select = 2:ncpus=2:mpiprocs=2
    Resource_List.slot_type = execute
    Resource_List.ungrouped = false
    Resource_List.walltime = 100:30:00
    stime = Thu Dec 19 20:02:04 2019
    session_id = 6412
    jobdir = /home/afernandez
    substate = 51
    Variable_List = PBS_O_SYSTEM=Linux,PBS_O_SHELL=/bin/bash,
        PBS_O_HOME=/home/afernandez,PBS_O_LOGNAME=afernandez,
        PBS_O_WORKDIR=/home/afernandez,PBS_O_LANG=en_US.UTF-8,
        PBS_O_PATH=/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/cycl
        e/jetpack/bin:/opt/pbs/bin:/opt/openmpi/bin:/home/afernandez/.local/bin:/ho
        me/afernandez/bin,PBS_O_MAIL=/var/spool/mail/afernandez,PBS_O_QUEUE=workq,
        PBS_O_HOST=ip-0a000204.iy4jdcoj0c5exd4bbj432msnqd.xx.internal.cloudapp
        .net
    comment = Job run at Thu Dec 19 at 20:02 on (ip-0A000205:ncpus=2)+(ip-0A000
        206:ncpus=2)
    etime = Thu Dec 19 19:56:49 2019
    run_count = 1
    Exit_status = 0
    Submit_arguments = HPCG31_4.sh
    pset = group_id=single
    project = _pbs_project_default

I don't understand why the output, error and working paths are showing as the IP plus iy4jdcoj0c5exd4bbj432msnqd.xx.internal.cloudapp.net or if this the reason preventing the files from showing up in the home directory.
Thanks.

Nodes are most of the time not unregistered in PBS

If I run azpbs autoscale here is the output, these nodes have been deprovisioned by Cycle and no longer exists.
[root@scheduler hpcadmin]# azpbs autoscale
2021-04-01 09:11:18,334 ERROR: Could not convert private_ip(None) to hostname using gethostbyaddr() for SchedulerNode(254rq000000, 254rq000000, unknown, None): gethostbyaddr() argument 1 must be str, bytes or bytearray, not None
NAME HOSTNAME PBS_STATE JOB_IDS STATE VM_SIZE DISK NGPUS GROUP_ID MACHINETYPE MEM NCPUS NODEARRAY SLOT_TYPE UNGROUPED INSTANCE_ID CTR ITR
254rq000000 254rq000000 down running unknown 20.00gb/20.00gb 0/0 s_v2_pg0 456.00gb 120 unknown hb120rs_v2 false 0.0 0.0
[root@scheduler hpcadmin]#

autoscaler not handling well bad formatted JSON qstat output

cyclecloud-pbspro version 2.0.10

With OpenPBS 19.1.1 output in JSON for qstat can be bad formatted in case of complex environment variables.
For example : qstat -f <job_id> -F json | jq '.' will return an error meaning the JSON is bad formatted.
As a result this make the autoscaler to stop working, so a single bad job can hang all the whole system and no new nodes can be added.

Here the output of azpbs autoscale

File "/opt/cycle/pbspro/venv/lib/python3.6/site-packages/pbspro/environment.py", line 58, in from_driver
--- Logging error ---
Traceback (most recent call last):
  File "/usr/lib64/python3.6/logging/init.py", line 996, in emit
    stream.write(msg)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 47623-47628: ordinal not in range(128)
Call stack:
  File "/root/bin/azpbs", line 4, in
    main()
  File "/opt/cycle/pbspro/venv/lib/python3.6/site-packages/pbspro/cli.py", line 284, in main
    clilib.main(argv or sys.argv[1:], "pbspro", PBSCLI())
  File "/opt/cycle/pbspro/venv/lib/python3.6/site-packages/hpc/autoscale/clilib.py", line 1739, in main
    args.func(**kwargs)
  File "/opt/cycle/pbspro/venv/lib/python3.6/site-packages/hpc/autoscale/clilib.py", line 1315, in analyze
    dcalc = self._demand(config, ctx_handler=ctx_handler)
  File "/opt/cycle/pbspro/venv/lib/python3.6/site-packages/hpc/autoscale/clilib.py", line 360, in _demand
    dcalc, jobs = self._demand_calc(config, driver)
  File "/opt/cycle/pbspro/venv/lib/python3.6/site-packages/pbspro/cli.py", line 113, in _demand_calc
    pbs_env = self._pbs_env(pbs_driver)
  File "/opt/cycle/pbspro/venv/lib/python3.6/site-packages/pbspro/cli.py", line 106, in _pbs_env
    self.__pbs_env = environment.from_driver(pbs_driver.config, pbs_driver)
  File "/opt/cycle/pbspro/venv/lib/python3.6/site-packages/pbspro/environment.py", line 58, in from_driver
    jobs = pbs_driver.parse_jobs(queues, default_scheduler.resources_for_scheduling)
  File "/opt/cycle/pbspro/venv/lib/python3.6/site-packages/pbspro/driver.py", line 414, in parse_jobs
    self.pbscmd, self.resource_definitions, queues, resources_for_scheduling
  File "/opt/cycle/pbspro/venv/lib/python3.6/site-packages/pbspro/driver.py", line 530, in parse_jobs
    response: Dict = pbscmd.qstat_json("-f", "-t")
  File "/opt/cycle/pbspro/venv/lib/python3.6/site-packages/pbspro/pbscmd.py", line 31, in qstat_json
    response = self.qstat(*args)
  File "/opt/cycle/pbspro/venv/lib/python3.6/site-packages/pbspro/pbscmd.py", line 25, in qstat
    return self._check_output(cmd)
  File "/opt/cycle/pbspro/venv/lib/python3.6/site-packages/pbspro/pbscmd.py", line 76, in _check_output
    logger.info("Response: %s", ret)
  File "/usr/lib64/python3.6/logging/init.py", line 1308, in info
    self._log(INFO, msg, args, **kwargs)
  File "/opt/cycle/pbspro/venv/lib/python3.6/site-packages/hpc/autoscale/hpclogging.py", line 45, in _log
    **stacklevelkw
Message: 'Response: %s'

azpbs remove_nodes doesn't remove them from pbsnodes

To repro, manually add nodes to pbs cluster with add nodes in the UI. They get joined to the cluster. Then remove them with:

azpbs remove_nodes -H ip-0A010907 -H ip-0A010908 --force

Note that --force is required. They seem to temporarily go through the "down state" but recover to free.

Does this command work?

[root@ip-0A010906 ~]# pbsnodes -aS
vnode           state           OS       hardware host            queue        mem     ncpus   nmics   ngpus  comment
--------------- --------------- -------- -------- --------------- ---------- -------- ------- ------- ------- ---------
ip-0A010907     down            --       --       ip-0a010907     --              4gb       1       0       0 --
ip-0A010908     state-unknown   --       --       ip-0a010908     --              4gb       1       0       0 --
[root@ip-0A010906 ~]# pbsnodes -aS
vnode           state           OS       hardware host            queue        mem     ncpus   nmics   ngpus  comment
--------------- --------------- -------- -------- --------------- ---------- -------- ------- ------- ------- ---------
ip-0A010907     free            --       --       ip-0a010907     --              4gb       1       0       0 --
ip-0A010908     initializing    --       --       ip-0a010908     --              4gb       1       0       0 --

Jetpack error while deploying pbspro cluster

Hello,
I'm trying to deploy a pbspro cluster with customized images. However, I keep running into the following error

Check /opt/cycle/jetpack/logs/installation.log for more information
Get more help on this issue
Detail:

Traceback (most recent call last):
  File "/opt/cycle/jetpack/system/embedded/lib/python2.7/site-packages/jetpack/admin/validate.py", line 27, in execute
    jetpack.util.test_cyclecloud_connection(connection)
  File "/opt/cycle/jetpack/system/embedded/lib/python2.7/site-packages/jetpack/util.py", line 439, in test_cyclecloud_connection
    r = _connect_to_cyclecloud(connection, path)
  File "/opt/cycle/jetpack/system/embedded/lib/python2.7/site-packages/jetpack/util.py", line 482, in _connect_to_cyclecloud
    conn, headers = jetpack.util.get_cyclecloud_connection(config)
  File "/opt/cycle/jetpack/system/embedded/lib/python2.7/site-packages/jetpack/util.py", line 380, in get_cyclecloud_connection
    if jetpack.config.get('cyclecloud.skip_ssl_validation', default_skip_ssl_validation):
  File "/opt/cycle/jetpack/system/embedded/lib/python2.7/site-packages/jetpack/config.py", line 29, in get
    raise ConfigError(UNKNOWN_ERROR)
ConfigError: An unknown error occurred while processing the configuration data file.

I've boiled down the customized image to an updated CentOS configuration where 'cmake' is installed. I'm unfamiliar with Jetpack so I cannot figure out which is the origin of the problem or how to fix it. Thanks.

Modifing stack softlimit

Hi,
I've noticed cyclecloud recently changed the behavior for limits of stack sizes.
Now it add this:

$ cat /etc/security/limits.conf |grep stack
#        - stack - max stack size (KB)
*               hard    stack           unlimited
*               soft    stack           unlimited

However I am not sure where this comes from, I can't find it in this repo and it is not from the CentOS HPC Image as far as I could tell (https://github.com/openlogic/AzureBuildCentOS)

In any case if someone else is falling over this, Abaqus at least does not accept unlimited as a soft limit.

Greetings
Klaas

Slot_type seems to be ignored when provisioning nodes

I have a cyclecloud cluster generated from a modified version of the PBSpro template. In it I have added different types of nodes for execution, such as memory optimized nodes and HPC nodes for heavy duty numerical simulations.

When scheduling a job making use of the #PBS -l slot_type=name_slot I have found that sometimes resources are allocated that do not match the configuration of the slot. The number of nodes and processes per node is respected but the actual type of node is not. Which can result in reduced performance for certain applications.

autoscaler is not adding nodes

running a non mpi job using “-l select=1:slot_type=execute:ungrouped=true” as a select statement.
The execute node array is not spot
But the autoscaler is not adding a new node

[xpillons@ondemand ~]$ qstat -fx 1651
Job Id: 1651.scheduler
Job_Name = sys-dashboard-sys-codeserver
Job_Owner = [email protected]
job_state = Q
queue = workq
server = scheduler
Checkpoint = u
ctime = Fri Nov 19 09:56:25 2021
Error_Path = ondemand.internal.cloudapp.net:/anfhome/xpillons/ondemand/data
/sys/dashboard/batch_connect/sys/codeserver/output/c1144623-b9b5-44a2-8
5b1-93fa66a0dc14/sys-dashboard-sys-codeserver.e1651
Hold_Types = n
Join_Path = oe
Keep_Files = n
Mail_Points = a
mtime = Fri Nov 19 09:56:25 2021
Output_Path = ondemand.internal.cloudapp.net:/anfhome/xpillons/ondemand/dat
a/sys/dashboard/batch_connect/sys/codeserver/output/c1144623-b9b5-44a2-
85b1-93fa66a0dc14/output.log
Priority = 0
qtime = Fri Nov 19 09:56:25 2021
Rerunable = True
Resource_List.ncpus = 1
Resource_List.nodect = 1
Resource_List.place = scatter:excl
Resource_List.select = 1:slot_type=execute:ungrouped=true
Resource_List.slot_type = execute
Resource_List.ungrouped = false
Resource_List.walltime = 03:00:00
Shell_Path_List = /bin/bash
substate = 10
Variable_List = PBS_O_HOME=/anfhome/xpillons,PBS_O_LANG=C,
PBS_O_LOGNAME=xpillons,
PBS_O_PATH=/var/www/ood/apps/sys/dashboard/tmp/node_modules/yarn/bin:/
opt/ood/ondemand/root/usr/share/gems/2.7/bin:/opt/rh/rh-nodejs12/root/u
sr/bin:/opt/rh/rh-ruby27/root/usr/local/bin:/opt/rh/rh-ruby27/root/usr/
bin:/opt/rh/httpd24/root/usr/bin:/opt/rh/httpd24/root/usr/sbin:/opt/ood
/ondemand/root/usr/bin:/opt/ood/ondemand/root/usr/sbin:/sbin:/bin:/usr/
sbin:/usr/bin,PBS_O_MAIL=/var/mail/root,PBS_O_SHELL=/bin/bash,
PBS_O_WORKDIR=/anfhome/xpillons/ondemand/data/sys/dashboard/batch_conn
ect/sys/codeserver/output/c1144623-b9b5-44a2-85b1-93fa66a0dc14,
PBS_O_SYSTEM=Linux,PBS_O_QUEUE=workq,
PBS_O_HOST=ondemand.internal.cloudapp.net
etime = Fri Nov 19 09:56:25 2021
Submit_arguments = -N sys-dashboard-sys-codeserver -S /bin/bash -o /anfhome
/xpillons/ondemand/data/sys/dashboard/batch_connect/sys/codeserver/outp
ut/c1144623-b9b5-44a2-85b1-93fa66a0dc14/output.log -j oe -l select=1:sl
ot_type=execute:ungrouped=true -l walltime=03:00:00
project = _pbs_project_default

[root@scheduler ~]# azpbs analyze --job-id 1651
NotInAPlacementGroup : Bucket[array=execute vm_size=Standard_F2s_v2 id=ac4fc82f-3d82-4f20-bd6e-67299dcdd388] is not in a placement group
NotInAPlacementGroup : Bucket[array=hb120rs_v2 vm_size=Standard_HB120rs_v2 id=4167f5f8-3a30-46df-92ba-8f77e0b636cd] is not in a placement group
NotInAPlacementGroup : Bucket[array=hb120rs_v3 vm_size=Standard_HB120rs_v3 id=aed72457-3b3c-44a0-ab8f-9ebb1a954fee] is not in a placement group
NotInAPlacementGroup : Bucket[array=hb60rs vm_size=Standard_HB60rs id=53bb8fd0-e62e-468f-b278-93da9e6fef3e] is not in a placement group
NotInAPlacementGroup : Bucket[array=hc44rs vm_size=Standard_HC44rs id=be3124ac-3894-496a-8580-ef06aba68273] is not in a placement group
NotInAPlacementGroup : Bucket[array=viz vm_size=Standard_D8s_v3 id=03e27828-b4bd-4678-a8e5-5886599bc6e5] is not in a placement group
NotInAPlacementGroup : Bucket[array=viz3d vm_size=Standard_NV6 id=ab76e8b1-0632-49a9-be16-93d880341855] is not in a placement group
InvalidOption : Resource[name=ungrouped value='false'] != 'true' for Bucket[array=execute vm_size=Standard_F2s_v2 attr=ungrouped]
InvalidOption : Resource[name=ungrouped value='false'] != 'true' for Bucket[array=execute vm_size=Standard_F2s_v2 attr=ungrouped]
InvalidOption : Resource[name=slot_type value='hb120rs_v2'] != 'execute' for Bucket[array=hb120rs_v2 vm_size=Standard_HB120rs_v2 attr=slot_type]
InvalidOption : Resource[name=slot_type value='hb120rs_v2'] != 'execute' for Bucket[array=hb120rs_v2 vm_size=Standard_HB120rs_v2 attr=slot_type]
InvalidOption : Resource[name=slot_type value='hb120rs_v3'] != 'execute' for Bucket[array=hb120rs_v3 vm_size=Standard_HB120rs_v3 attr=slot_type]
InvalidOption : Resource[name=slot_type value='hb120rs_v3'] != 'execute' for Bucket[array=hb120rs_v3 vm_size=Standard_HB120rs_v3 attr=slot_type]
InvalidOption : Resource[name=slot_type value='hb60rs'] != 'execute' for Bucket[array=hb60rs vm_size=Standard_HB60rs attr=slot_type]
InvalidOption : Resource[name=slot_type value='hb60rs'] != 'execute' for Bucket[array=hb60rs vm_size=Standard_HB60rs attr=slot_type]
InvalidOption : Resource[name=slot_type value='hc44rs'] != 'execute' for Bucket[array=hc44rs vm_size=Standard_HC44rs attr=slot_type]
InvalidOption : Resource[name=slot_type value='hc44rs'] != 'execute' for Bucket[array=hc44rs vm_size=Standard_HC44rs attr=slot_type]
InvalidOption : Resource[name=slot_type value='viz'] != 'execute' for Bucket[array=viz vm_size=Standard_D8s_v3 attr=slot_type]
InvalidOption : Resource[name=slot_type value='viz'] != 'execute' for Bucket[array=viz vm_size=Standard_D8s_v3 attr=slot_type]
InvalidOption : Resource[name=slot_type value='viz3d'] != 'execute' for Bucket[array=viz3d vm_size=Standard_NV6 attr=slot_type]
InvalidOption : Resource[name=slot_type value='viz3d'] != 'execute' for Bucket[array=viz3d vm_size=Standard_NV6 attr=slot_type]

NoCandidatesFound : SatisfiedResult(status=NotInAPlacementGroup, node=NodeBucket(nodearray=execute, vm_size=Standard_F2s_v2, pg=None),reason=Bucket[array=execute vm_size=Standard_F2s_v2 id=ac4fc82f-3d82-4f20-bd6e-67299dcdd388] is not
SatisfiedResult(status=NotInAPlacementGroup, node=NodeBucket(nodearray=hb120rs_v2, vm_size=Standard_HB120rs_v2, pg=None),reason=Bucket[array=hb120rs_v2 vm_size=Standard_HB120rs_v2 id=4167f5f8-3a30-46df-92ba-8f77e0
SatisfiedResult(status=NotInAPlacementGroup, node=NodeBucket(nodearray=hb120rs_v3, vm_size=Standard_HB120rs_v3, pg=None),reason=Bucket[array=hb120rs_v3 vm_size=Standard_HB120rs_v3 id=aed72457-3b3c-44a0-ab8f-9ebb1a
SatisfiedResult(status=NotInAPlacementGroup, node=NodeBucket(nodearray=hb60rs, vm_size=Standard_HB60rs, pg=None),reason=Bucket[array=hb60rs vm_size=Standard_HB60rs id=53bb8fd0-e62e-468f-b278-93da9e6fef3e] is not i
SatisfiedResult(status=NotInAPlacementGroup, node=NodeBucket(nodearray=hc44rs, vm_size=Standard_HC44rs, pg=None),reason=Bucket[array=hc44rs vm_size=Standard_HC44rs id=be3124ac-3894-496a-8580-ef06aba68273] is not i
SatisfiedResult(status=NotInAPlacementGroup, node=NodeBucket(nodearray=viz, vm_size=Standard_D8s_v3, pg=None),reason=Bucket[array=viz vm_size=Standard_D8s_v3 id=03e27828-b4bd-4678-a8e5-5886599bc6e5] is not in a pl
SatisfiedResult(status=NotInAPlacementGroup, node=NodeBucket(nodearray=viz3d, vm_size=Standard_NV6, pg=None),reason=Bucket[array=viz3d vm_size=Standard_NV6 id=ab76e8b1-0632-49a9-be16-93d880341855] is not in a plac
SatisfiedResult(status=InvalidOption, node=NodeBucket(nodearray=execute, vm_size=Standard_F2s_v2, pg=Standard_F2s_v2_pg0),reason=Resource[name=ungrouped value='false'] != 'true' for Bucket[array=execute vm_size=St
SatisfiedResult(status=InvalidOption, node=NodeBucket(nodearray=execute, vm_size=Standard_F2s_v2, pg=Standard_F2s_v2_pg1),reason=Resource[name=ungrouped value='false'] != 'true' for Bucket[array=execute vm_size=St
SatisfiedResult(status=InvalidOption, node=NodeBucket(nodearray=hb120rs_v2, vm_size=Standard_HB120rs_v2, pg=Standard_HB120rs_v2_pg0),reason=Resource[name=slot_type value='hb120rs_v2'] != 'execute' for Bucket[array
SatisfiedResult(status=InvalidOption, node=NodeBucket(nodearray=hb120rs_v2, vm_size=Standard_HB120rs_v2, pg=Standard_HB120rs_v2_pg1),reason=Resource[name=slot_type value='hb120rs_v2'] != 'execute' for Bucket[array
SatisfiedResult(status=InvalidOption, node=NodeBucket(nodearray=hb120rs_v3, vm_size=Standard_HB120rs_v3, pg=Standard_HB120rs_v3_pg0),reason=Resource[name=slot_type value='hb120rs_v3'] != 'execute' for Bucket[array
SatisfiedResult(status=InvalidOption, node=NodeBucket(nodearray=hb120rs_v3, vm_size=Standard_HB120rs_v3, pg=Standard_HB120rs_v3_pg1),reason=Resource[name=slot_type value='hb120rs_v3'] != 'execute' for Bucket[array
SatisfiedResult(status=InvalidOption, node=NodeBucket(nodearray=hb60rs, vm_size=Standard_HB60rs, pg=Standard_HB60rs_pg0),reason=Resource[name=slot_type value='hb60rs'] != 'execute' for Bucket[array=hb60rs vm_size=
SatisfiedResult(status=InvalidOption, node=NodeBucket(nodearray=hb60rs, vm_size=Standard_HB60rs, pg=Standard_HB60rs_pg1),reason=Resource[name=slot_type value='hb60rs'] != 'execute' for Bucket[array=hb60rs vm_size=
SatisfiedResult(status=InvalidOption, node=NodeBucket(nodearray=hc44rs, vm_size=Standard_HC44rs, pg=Standard_HC44rs_pg0),reason=Resource[name=slot_type value='hc44rs'] != 'execute' for Bucket[array=hc44rs vm_size=
SatisfiedResult(status=InvalidOption, node=NodeBucket(nodearray=hc44rs, vm_size=Standard_HC44rs, pg=Standard_HC44rs_pg1),reason=Resource[name=slot_type value='hc44rs'] != 'execute' for Bucket[array=hc44rs vm_size=
SatisfiedResult(status=InvalidOption, node=NodeBucket(nodearray=viz, vm_size=Standard_D8s_v3, pg=Standard_D8s_v3_pg0),reason=Resource[name=slot_type value='viz'] != 'execute' for Bucket[array=viz vm_size=Standard_
SatisfiedResult(status=InvalidOption, node=NodeBucket(nodearray=viz, vm_size=Standard_D8s_v3, pg=Standard_D8s_v3_pg1),reason=Resource[name=slot_type value='viz'] != 'execute' for Bucket[array=viz vm_size=Standard_
SatisfiedResult(status=InvalidOption, node=NodeBucket(nodearray=viz3d, vm_size=Standard_NV6, pg=Standard_NV6_pg0),reason=Resource[name=slot_type value='viz3d'] != 'execute' for Bucket[array=viz3d vm_size=Standard_
SatisfiedResult(status=InvalidOption, node=NodeBucket(nodearray=viz3d, vm_size=Standard_NV6, pg=Standard_NV6_pg1),reason=Resource[name=slot_type value='viz3d'] != 'execute' for Bucket[array=viz3d vm_size=Standard_

[root@scheduler ~]# azpbs buckets
NODEARRAY PLACEMENT_GROUP VM_SIZE VCPU_COUNT PCPU_COUNT MEMORY AVAILABLE_COUNT NCPUS NGPUS DISK HOST SLOT_TYPE GROUP_ID MEM CCNODEID UNGROUPED
execute Standard_F2s_v2 2 1 4.00g 512 1 0 20.00g execute none 4.00g true
execute Standard_F2s_v2_pg0 Standard_F2s_v2 2 1 4.00g 100 1 0 20.00g execute Standard_F2s_v2_pg0 4.00g false
execute Standard_F2s_v2_pg1 Standard_F2s_v2 2 1 4.00g 100 1 0 20.00g execute Standard_F2s_v2_pg1 4.00g false
hb120rs_v2 Standard_HB120rs_v2 120 120 456.00g 72 120 0 20.00g hb120rs_v2 none 456.00g true
hb120rs_v2 Standard_HB120rs_v2_pg0 Standard_HB120rs_v2 120 120 456.00g 72 120 0 20.00g hb120rs_v2 Standard_HB120rs_v2_pg0 456.00g false
hb120rs_v2 Standard_HB120rs_v2_pg1 Standard_HB120rs_v2 120 120 456.00g 72 120 0 20.00g hb120rs_v2 Standard_HB120rs_v2_pg1 456.00g false
hb120rs_v3 Standard_HB120rs_v3 120 120 448.00g 10 120 0 20.00g hb120rs_v3 none 448.00g true
hb120rs_v3 Standard_HB120rs_v3_pg0 Standard_HB120rs_v3 120 120 448.00g 10 120 0 20.00g hb120rs_v3 Standard_HB120rs_v3_pg0 448.00g false
hb120rs_v3 Standard_HB120rs_v3_pg1 Standard_HB120rs_v3 120 120 448.00g 10 120 0 20.00g hb120rs_v3 Standard_HB120rs_v3_pg1 448.00g false
hb60rs Standard_HB60rs 60 60 228.00g 40 60 0 20.00g hb60rs none 228.00g true
hb60rs Standard_HB60rs_pg0 Standard_HB60rs 60 60 228.00g 40 60 0 20.00g hb60rs Standard_HB60rs_pg0 228.00g false
hb60rs Standard_HB60rs_pg1 Standard_HB60rs 60 60 228.00g 40 60 0 20.00g hb60rs Standard_HB60rs_pg1 228.00g false
hc44rs Standard_HC44rs 44 44 352.00g 40 44 0 20.00g hc44rs none 352.00g true
hc44rs Standard_HC44rs_pg0 Standard_HC44rs 44 44 352.00g 40 44 0 20.00g hc44rs Standard_HC44rs_pg0 352.00g false
hc44rs Standard_HC44rs_pg1 Standard_HC44rs 44 44 352.00g 40 44 0 20.00g hc44rs Standard_HC44rs_pg1 352.00g false
viz Standard_D8s_v3 8 4 32.00g 50 4 0 20.00g viz none 32.00g true
viz Standard_D8s_v3_pg0 Standard_D8s_v3 8 4 32.00g 50 4 0 20.00g viz Standard_D8s_v3_pg0 32.00g false
viz Standard_D8s_v3_pg1 Standard_D8s_v3 8 4 32.00g 50 4 0 20.00g viz Standard_D8s_v3_pg1 32.00g false
viz3d Standard_NV6 6 6 56.00g 10 6 1 20.00g viz3d none 56.00g true
viz3d Standard_NV6_pg0 Standard_NV6 6 6 56.00g 10 6 1 20.00g viz3d Standard_NV6_pg0 56.00g false
viz3d Standard_NV6_pg1 Standard_NV6 6 6 56.00g 10 6 1 20.00g viz3d Standard_NV6_pg1 56.00g false

Request for PBS Pro 19.1.2

I would like to use PBS Pro 19.1.2 to match the version with other production environments.
The detailed version of CentOS 7 is fine as long as it is the stable version.

slot_type is case sensitive

If a slot type is defined as hbv3 then the following job will never start

qsub -l select=1:slot_type=HBv3 -I

hwloc-libs RPM is no longer provided in the epel repo CentOS 8

openpbs installers failing with missing package hwloc-libs. This is impacting CentOS 8.

work around is do download from alma linux:
- wget -O /tmp/hwloc-libs-1.11.9-3.el8.x86_64.rpm wget https://repo.almalinux.org/almalinux/8.3-beta/BaseOS/x86_64/os/Packages/hwloc-libs-1.11.9-3.el8.x86_64.rpm
- yum install -y /tmp/hwloc-libs-1.11.9-3.el8.x86_64.rpm
- rm -f /tmp/hwloc-libs-1.11.9-3.el8.x86_64.rpm

Can we redistribute in release artifacts?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.