stackhpc / p3-appliances Goto Github PK
View Code? Open in Web Editor NEWTemplates and playbooks for creating middleware platforms on ALaSKA P3
Templates and playbooks for creating middleware platforms on ALaSKA P3
p3-appliances/config/openhpc.yml
Line 23 in 7ac640e
reboot_and_wait role is run as root, which means the
latest-packages.yml
runs the reboot_and_wait
role as become
. If the user doesn't have sudo rights on the deployment host then the two local_action
s in p3-appliances/ansible/roles/reboot_and_wait/tasks/main.yml
fail.
Added so far: bison
, flex
, and gcc-gfortran
, blas-devel
and lapack-devel
Isn't required in this case.
[stack@dev-director tests]$ OS_CLOUD=alaska-alt-1 ./test-kubernetes.sh
TASK [stackhpc.os-container-infra : Ensure container cluster is present] ****************************************************************************************************
An exception occurred during task execution. To see the full traceback, use -vvv. The error was: magnumclient.common.apiclient.exceptions.NotFound: ClusterTemplate k8s-fedora-atomic-29 could not be found (HTTP 404) (Request-ID: req-74d591da-8018-44db-8044-8870b5e46d3c)
fatal: [localhost]: FAILED! => {"changed": false, "module_stderr": "Traceback (most recent call last):\n File \"/home/stack/.ansible/tmp/ansible-tmp-1572626042.4-219406216608257/AnsiballZ_os_container_infra.py\", line 102, in <module>\n _ansiballz_main()\n File \"/home/stack/.ansible/tmp/ansible-tmp-1572626042.4-219406216608257/AnsiballZ_os_container_infra.py\", line 94, in _ansiballz_main\n invoke_module(zipped_mod, temp_path, ANSIBALLZ_PARAMS)\n File \"/home/stack/.ansible/tmp/ansible-tmp-1572626042.4-219406216608257/AnsiballZ_os_container_infra.py\", line 40, in invoke_module\n runpy.run_module(mod_name='ansible.modules.os_container_infra', init_globals=None, run_name='__main__', alter_sys=False)\n File \"/usr/lib64/python2.7/runpy.py\", line 180, in run_module\n fname, loader, pkg_name)\n File \"/usr/lib64/python2.7/runpy.py\", line 72, in _run_code\n exec code in run_globals\n File \"/tmp/ansible_os_container_infra_payload_Xp4mwz/ansible_os_container_infra_payload.zip/ansible/modules/os_container_infra.py\", line 223, in <module>\n File \"/tmp/ansible_os_container_infra_payload_Xp4mwz/ansible_os_container_infra_payload.zip/ansible/modules/os_container_infra.py\", line 101, in __init__\n File \"/home/stack/will/p3-appliances/venv/lib/python2.7/site-packages/magnumclient/v1/basemodels.py\", line 100, in get\n return self._list(self._path(id))[0]\n File \"/home/stack/will/p3-appliances/venv/lib/python2.7/site-packages/magnumclient/common/base.py\", line 121, in _list\n resp, body = self.api.json_request('GET', url)\n File \"/home/stack/will/p3-appliances/venv/lib/python2.7/site-packages/magnumclient/common/httpclient.py\", line 368, in json_request\n resp = self._http_request(url, method, **kwargs)\n File \"/home/stack/will/p3-appliances/venv/lib/python2.7/site-packages/magnumclient/common/httpclient.py\", line 352, in _http_request\n error_json.get('debuginfo'), method, url)\nmagnumclient.common.apiclient.exceptions.NotFound: ClusterTemplate k8s-fedora-atomic-29 could not be found (HTTP 404) (Request-ID: req-74d591da-8018-44db-8044-8870b5e46d3c)\n", "module_stdout": "", "msg": "MODULE FAILURE\nSee stdout/stderr for the exact error", "rc": 1}
For example, if a secret is not found (eg if it has been misnamed). The Ansible should be clearer describing what went wrong.
TASK [configure_ib : Ensure kernel modules persist] *************************************************************************************************************************
changed: [test-kubernetes-7zz57s6du6yy-minion-0] => (item=rdma_ucm)
changed: [test-kubernetes-7zz57s6du6yy-master-0] => (item=rdma_ucm)
changed: [test-kubernetes-7zz57s6du6yy-minion-0] => (item=rdma_cm)
changed: [test-kubernetes-7zz57s6du6yy-master-0] => (item=rdma_cm)
changed: [test-kubernetes-7zz57s6du6yy-master-0] => (item=mlx5_core)
changed: [test-kubernetes-7zz57s6du6yy-minion-0] => (item=mlx5_core)
changed: [test-kubernetes-7zz57s6du6yy-master-0] => (item=mlx5_ib)
changed: [test-kubernetes-7zz57s6du6yy-minion-0] => (item=mlx5_ib)
changed: [test-kubernetes-7zz57s6du6yy-master-0] => (item=ib_core)
changed: [test-kubernetes-7zz57s6du6yy-minion-0] => (item=ib_core)
changed: [test-kubernetes-7zz57s6du6yy-minion-0] => (item=ib_uverbs)
changed: [test-kubernetes-7zz57s6du6yy-master-0] => (item=ib_uverbs)
changed: [test-kubernetes-7zz57s6du6yy-minion-0] => (item=ib_ipoib)
ERROR! The requested handler 'Reset interface' was not found in either the main handlers list nor in the listening handlers list
While running cluster-infra.yml
to add a node to the openhpc cluster on alaSKA:
TASK [stackhpc.cluster-infra : Attach interfaces to servers] *********************************************************************************************************ok: [localhost] => (item=e527c3d4-59d8-4acb-b5fa-df0b9ab2e8e4)
ok: [localhost] => (item=7c61fb21-34cc-4de2-8cbc-b30ba9c5529b)
ok: [localhost] => (item=b869d227-261b-49ac-95d1-9dd137c7b72e)
failed: [localhost] (item=6885deba-1a3d-4a8f-ada4-296a24a6f9dc) => {"ansible_loop_var": "item", "changed": false, "item": "6885deba-1a3d-4a8f-ada4-296a24a6f9dc", "msg": "NotFound()"}
openstack baremetal node list
showed the node it was trying to add was in a cleaning state.
In the OpenHPC playbook, a hard-coded list of IB kernel modules is loaded. This should be replaced with a generic module for (persistently) managing kernel modules, which could make use of a galaxy role to do it (or could simply use a local role in the repo).
There are examples elsewhere in our code base of better handling of kernel modules.
I'm trying to use the local container registry on openhpc-login-0 but am seeing an error message: Get https://localhost:5000/v2/: http: server gave HTTP response to HTTPS client. I remember seeing something like this in December and I asked Bharat about it then - he said he had to add this into /etc/docker/daemon.json :
{
"insecure-registries": ["openhpc-login-0:5000"]
}
Not sure why but that file seems to be missing now. Please could someone with root permission on that machine add it back and I'll try again?
from fred via slack
While running cluster-infra-configure.yml to add a node to the openhpc cluster on alaSKA:
TASK [stackhpc.os-config : Create OpenStack config dir]
**************************************************************************************************************
fatal: [steveb-compute-1]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: System is booting up. See pam_nologin(8)\nAuthentication failed.", "unreachable": true}
I think the reboot flagged as the problem here was caused here by package updates.
Running cluster-infra-configure
updates all packages. This is probably undesirable as it means centos 7.6 gets updated to 7.7 which isn't a supported/tested base OS for the current version of openhpc.
/If/ an update is really desirable as part of cluster-infra-configure
, from the ohpc upgrade docs (Appendix B of user manual) I think that update "*" in ansible/latest-packages.yml
should instead something like "ohpc-base" and/or "ohpc-base-compute" to get the base-os packages the ohpc packages depend on1, plus "*-ohpc" to update the actual ohpc packages.
However I tried this (with the aim of adding a PR) and on adding a node to the cluster cluster-infra-configure
would always fail the first time due to unresolvable package dependencies. If I ran cluster-infra-configure
with all updates commented out the node would come up, then I could run cluster-infra-configure
again with the above ohpc updates and the playbook would complete but with no changes to packages on the new node. Maybe something else is doing a yum update (during node reboot?) which resolves dependencies?
1 I'm not quite clear which packages the login vs. compute nodes use.
command:
ansible-playbook --vault-password-file monasca-secrets -e @config/steveb.yml -i ansible/inventory-steveb ansible/cluster-infra-configure.yml
error:
...
TASK [configure_ib : Ensure kernel modules persist]
...
ERROR! The requested handler 'Reset interface' was not found in either the main handlers list nor in the listening handlers list
However in p3-appliances/ansible/roles/configure_ib/handlers/main.yml
does have a handler with this name which include ./reset_interface.yml
. Some testing showed the import seemed to be the problem; import_tasks
didn't fix it either.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.