Giter VIP home page Giter VIP logo

ansible_collection_slurm_openstack_tools's Introduction

Ansible Collection - stackhpc.slurm_tools

Tools to add functionality to a Slurm-based OpenHPC cluster on OpenStack created by stackhpc.openhpc.

Roles

  • stackhpc.slurm_openstack_tools.test: Test MPI functionality - README
  • stackhpc.slurm_openstack_tools.rebuild: Add Slurm-controled rebuild/reimage capability - README.
  • stackhpc.slurm_openstack_tools.pytools: Add python utilities used by other roles - README.
  • stackhpc.slurm_openstack_tools.slurm-stats: Configures a tool to transform sacct output for import into elasticsearch/loki - README.
  • stackhpc.slurm_openstack_tools.autoscale: Add Slurm autosacling functionality - README.

ansible_collection_slurm_openstack_tools's People

Contributors

jovial avatar sjpb avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

ansible_collection_slurm_openstack_tools's Issues

rebuild role fails

    - import_role:
        name: stackhpc.slurm_openstack_tools.rebuild
MSG:

non-zero return code
fatal: [hpc-1]: FAILED! => {
    "changed": true,
    "cmd": "source /opt/slurm-tools/bin/activate\npip install pip\npip install git+https://github.com/stackhpc/slurm-openstack-tools.git\n",
    "delta": "0:00:34.267451",
    "end": "2021-03-10 13:31:59.216334",
    "rc": 1,
    "start": "2021-03-10 13:31:24.948883"
}

STDOUT:
<snip>
Collecting cryptography>=2.7 (from openstacksdk>=0.48.0->slurm-openstack-tools==1.0.1.dev7)
  Downloading https://files.pythonhosted.org/packages/fa/2d/2154d8cb773064570f48ec0b60258a4522490fcb115a6c7c9423482ca993/cryptography-3.4.6.tar.gz (546kB)
    Complete output from command python setup.py egg_info:
    
            =============================DEBUG ASSISTANCE==========================
            If you are seeing an error here please try the following to
            successfully install cryptography:
    
            Upgrade to the latest pip and try again. This will fix errors for most
            users. See: https://pip.pypa.io/en/stable/installing/#upgrading-pip
            =============================DEBUG ASSISTANCE==========================
    
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-build-wybs3vmy/cryptography/setup.py", line 14, in <module>
        from setuptools_rust import RustExtension
    ModuleNotFoundError: No module named 'setuptools_rust'

tests - unmount /opt not happening

Failures are ignored on this step as the ansible mount with state: absent task fails because /opt is not empty - but it appears that the unmount itself is actually not working.

pytools virtualenv install fails: No module named 'setuptools_rust

Collecting cryptography>=2.7 (from openstacksdk>=0.48.0->slurm-openstack-tools==1.0.1.dev7)
  Downloading https://files.pythonhosted.org/packages/27/5a/007acee0243186123a55423d49cbb5c15cb02d76dd1b6a27659a894b13a2/cryptography-3.4.4.tar.gz (545kB)
    100% |████████████████████████████████| 552kB 3.2MB/s 
    Complete output from command python setup.py egg_info:
    
            =============================DEBUG ASSISTANCE==========================
            If you are seeing an error here please try the following to
            successfully install cryptography:
    
            Upgrade to the latest pip and try again. This will fix errors for most
            users. See: https://pip.pypa.io/en/stable/installing/#upgrading-pip
            =============================DEBUG ASSISTANCE==========================
    
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-build-w6y77qjc/cryptography/setup.py", line 14, in <module>
        from setuptools_rust import RustExtension
    ModuleNotFoundError: No module named 'setuptools_rust'
    ```
   
   Seems that upgrading to the latest pip fixes this issue. Patch incoming...

Rebuild: move reboot to controller

Use slurm.conf parameter SlurmctldParameters=reboot_from_controller (available in ohpc v2 at least):

Run the RebootProgram from the controller instead of on the slurmds. The RebootProgram will be passed a comma-separated list of nodes to reboot.

This avoids copying the clouds.yaml file out to compute nodes

tests: Autodetect HPL blocksize NB

With the default of 192 on a cascade lake system where NB should be 384, HPL lost ~10% performance. There's no clean/robust autodetection of CPU arch though and would also want to handle cases on e.g. VMs where /proc/cpuinfo contains something generic like "Broadwell". And at least still run on non-Intel/unknown cases.

Start with something like cat /proc/cpuinfo | grep "model name" | uniq

tests - mount /opt without using stackhpc.nfs

This leaves an entry in fstab which isn't really what we want for a temporary mountpoint. We also know that the source and target (/opt for both) will exist, so don't really need the machinery it or the ansible mount module provides.

rebuild fails quietly if ID is wrong

If a user tries to rebuild with an incorrect image ID (e.g. using the image name instead of ID) then this fails fairly silently - slurm shows the node going down and coming back up, but it infact doesn't even reboot, it's just slurm timeouts in action.

Its not a silent failure as the compute node's log does show e.g.:

Sep 17 10:10:49 alaska-compute-0 journal[111230]: rebuilding openstack server
Sep 17 10:10:52 alaska-compute-0 journal[111230]: user requested image:%ohpc-compute-210917-0822.qcow2
Sep 17 10:10:52 alaska-compute-0 journal[111230]: rebuilding server %2b896336-ab81-4498-bdde-75e311966ce9 with image %ohpc-compute-210917-0822.qcow2
Sep 17 10:10:52 alaska-compute-0 slurmd[107404]: Traceback (most recent call last):
Sep 17 10:10:52 alaska-compute-0 slurmd[107404]:  File "/opt/slurm-tools/bin/slurm-openstack-rebuild", line 10, in <module>
Sep 17 10:10:52 alaska-compute-0 slurmd[107404]:    sys.exit(rebuild_or_reboot())
Sep 17 10:10:52 alaska-compute-0 slurmd[107404]:  File "/opt/slurm-tools/lib64/python3.6/site-packages/slurm_openstack_tools/reboot.py", line 141, in rebuild_or_reboot
Sep 17 10:10:52 alaska-compute-0 slurmd[107404]:    rebuild_openstack_server(server_uuid, reason)
Sep 17 10:10:52 alaska-compute-0 slurmd[107404]:  File "/opt/slurm-tools/lib64/python3.6/site-packages/slurm_openstack_tools/reboot.py", line 97, in rebuild_openstack_server
Sep 17 10:10:52 alaska-compute-0 slurmd[107404]:    conn.rebuild_server(server_id, image_uuid)
Sep 17 10:10:52 alaska-compute-0 slurmd[107404]:  File "/opt/slurm-tools/lib64/python3.6/site-packages/openstack/cloud/_compute.py", line 1118, in rebuild_server
Sep 17 10:10:52 alaska-compute-0 slurmd[107404]:    error_message="Error in rebuilding instance")
Sep 17 10:10:52 alaska-compute-0 slurmd[107404]:  File "/opt/slurm-tools/lib64/python3.6/site-packages/openstack/proxy.py", line 647, in _json_response
Sep 17 10:10:52 alaska-compute-0 slurmd[107404]:    exceptions.raise_from_response(response, error_message=error_message)
Sep 17 10:10:52 alaska-compute-0 slurmd[107404]:  File "/opt/slurm-tools/lib64/python3.6/site-packages/openstack/exceptions.py", line 238, in raise_from_response
Sep 17 10:10:52 alaska-compute-0 slurmd[107404]:    http_status=http_status, request_id=request_id
Sep 17 10:10:52 alaska-compute-0 slurmd[107404]: openstack.exceptions.BadRequestException: Error in rebuilding instance: Client Error for url: https://arcus.openstack.hpc.cam.ac.uk:8774/v2.1/servers/2b896336-ab81-4498-bdde-75e311966ce9/action, Invalid input for field/attribute imageRef. Value: ohpc-compute-210917-0822.qcow2. u'ohpc-compute-210917-0822.qcow2' is not a 'uuid'

but that's not very obvious!

It should fail the reboot and NOT show the node going down in sinfo.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.