Giter VIP home page Giter VIP logo

p3-appliances's People

Contributors

brtkwr avatar dougszumski avatar jovial avatar markgoddard avatar mserylak avatar oneswig avatar piersharding avatar sjpb avatar wasaac avatar

Stargazers

 avatar

Watchers

 avatar  avatar

p3-appliances's Issues

test-kubernetes.sh fails on alt1 due to missing template

[stack@dev-director tests]$ OS_CLOUD=alaska-alt-1 ./test-kubernetes.sh
TASK [stackhpc.os-container-infra : Ensure container cluster is present] ****************************************************************************************************
An exception occurred during task execution. To see the full traceback, use -vvv. The error was: magnumclient.common.apiclient.exceptions.NotFound: ClusterTemplate k8s-fedora-atomic-29 could not be found (HTTP 404) (Request-ID: req-74d591da-8018-44db-8044-8870b5e46d3c)
fatal: [localhost]: FAILED! => {"changed": false, "module_stderr": "Traceback (most recent call last):\n  File \"/home/stack/.ansible/tmp/ansible-tmp-1572626042.4-219406216608257/AnsiballZ_os_container_infra.py\", line 102, in <module>\n    _ansiballz_main()\n  File \"/home/stack/.ansible/tmp/ansible-tmp-1572626042.4-219406216608257/AnsiballZ_os_container_infra.py\", line 94, in _ansiballz_main\n    invoke_module(zipped_mod, temp_path, ANSIBALLZ_PARAMS)\n  File \"/home/stack/.ansible/tmp/ansible-tmp-1572626042.4-219406216608257/AnsiballZ_os_container_infra.py\", line 40, in invoke_module\n    runpy.run_module(mod_name='ansible.modules.os_container_infra', init_globals=None, run_name='__main__', alter_sys=False)\n  File \"/usr/lib64/python2.7/runpy.py\", line 180, in run_module\n    fname, loader, pkg_name)\n  File \"/usr/lib64/python2.7/runpy.py\", line 72, in _run_code\n    exec code in run_globals\n  File \"/tmp/ansible_os_container_infra_payload_Xp4mwz/ansible_os_container_infra_payload.zip/ansible/modules/os_container_infra.py\", line 223, in <module>\n  File \"/tmp/ansible_os_container_infra_payload_Xp4mwz/ansible_os_container_infra_payload.zip/ansible/modules/os_container_infra.py\", line 101, in __init__\n  File \"/home/stack/will/p3-appliances/venv/lib/python2.7/site-packages/magnumclient/v1/basemodels.py\", line 100, in get\n    return self._list(self._path(id))[0]\n  File \"/home/stack/will/p3-appliances/venv/lib/python2.7/site-packages/magnumclient/common/base.py\", line 121, in _list\n    resp, body = self.api.json_request('GET', url)\n  File \"/home/stack/will/p3-appliances/venv/lib/python2.7/site-packages/magnumclient/common/httpclient.py\", line 368, in json_request\n    resp = self._http_request(url, method, **kwargs)\n  File \"/home/stack/will/p3-appliances/venv/lib/python2.7/site-packages/magnumclient/common/httpclient.py\", line 352, in _http_request\n    error_json.get('debuginfo'), method, url)\nmagnumclient.common.apiclient.exceptions.NotFound: ClusterTemplate k8s-fedora-atomic-29 could not be found (HTTP 404) (Request-ID: req-74d591da-8018-44db-8044-8870b5e46d3c)\n", "module_stdout": "", "msg": "MODULE FAILURE\nSee stdout/stderr for the exact error", "rc": 1}

Missing handler when using ./test-kubernetes.sh on production

TASK [configure_ib : Ensure kernel modules persist] *************************************************************************************************************************
changed: [test-kubernetes-7zz57s6du6yy-minion-0] => (item=rdma_ucm)
changed: [test-kubernetes-7zz57s6du6yy-master-0] => (item=rdma_ucm)
changed: [test-kubernetes-7zz57s6du6yy-minion-0] => (item=rdma_cm)
changed: [test-kubernetes-7zz57s6du6yy-master-0] => (item=rdma_cm)
changed: [test-kubernetes-7zz57s6du6yy-master-0] => (item=mlx5_core)
changed: [test-kubernetes-7zz57s6du6yy-minion-0] => (item=mlx5_core)
changed: [test-kubernetes-7zz57s6du6yy-master-0] => (item=mlx5_ib)
changed: [test-kubernetes-7zz57s6du6yy-minion-0] => (item=mlx5_ib)
changed: [test-kubernetes-7zz57s6du6yy-master-0] => (item=ib_core)
changed: [test-kubernetes-7zz57s6du6yy-minion-0] => (item=ib_core)
changed: [test-kubernetes-7zz57s6du6yy-minion-0] => (item=ib_uverbs)
changed: [test-kubernetes-7zz57s6du6yy-master-0] => (item=ib_uverbs)
changed: [test-kubernetes-7zz57s6du6yy-minion-0] => (item=ib_ipoib)
ERROR! The requested handler 'Reset interface' was not found in either the main handlers list nor in the listening handlers list

cluster-infra fails on adding a node

While running cluster-infra.yml to add a node to the openhpc cluster on alaSKA:

TASK [stackhpc.cluster-infra : Attach interfaces to servers] *********************************************************************************************************ok: [localhost] => (item=e527c3d4-59d8-4acb-b5fa-df0b9ab2e8e4)
ok: [localhost] => (item=7c61fb21-34cc-4de2-8cbc-b30ba9c5529b)
ok: [localhost] => (item=b869d227-261b-49ac-95d1-9dd137c7b72e)
failed: [localhost] (item=6885deba-1a3d-4a8f-ada4-296a24a6f9dc) => {"ansible_loop_var": "item", "changed": false, "item": "6885deba-1a3d-4a8f-ada4-296a24a6f9dc", "msg": "NotFound()"}

openstack baremetal node list showed the node it was trying to add was in a cleaning state.

More general method required for managing kernel modules

In the OpenHPC playbook, a hard-coded list of IB kernel modules is loaded. This should be replaced with a generic module for (persistently) managing kernel modules, which could make use of a galaxy role to do it (or could simply use a local role in the repo).

There are examples elsewhere in our code base of better handling of kernel modules.

docker registry not set up correctly

I'm trying to use the local container registry on openhpc-login-0 but am seeing an error message: Get https://localhost:5000/v2/: http: server gave HTTP response to HTTPS client. I remember seeing something like this in December and I asked Bharat about it then - he said he had to add this into /etc/docker/daemon.json :
{
  "insecure-registries": ["openhpc-login-0:5000"]
}
Not sure why but that file seems to be missing now. Please could someone with root permission on that machine add it back and I'll try again?

from fred via slack

cluster-infra-configure fails if node reboots

While running cluster-infra-configure.yml to add a node to the openhpc cluster on alaSKA:

TASK [stackhpc.os-config : Create OpenStack config dir]
**************************************************************************************************************
fatal: [steveb-compute-1]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: System is booting up. See pam_nologin(8)\nAuthentication failed.", "unreachable": true}

I think the reboot flagged as the problem here was caused here by package updates.

all packages updated by default

Running cluster-infra-configure updates all packages. This is probably undesirable as it means centos 7.6 gets updated to 7.7 which isn't a supported/tested base OS for the current version of openhpc.

/If/ an update is really desirable as part of cluster-infra-configure, from the ohpc upgrade docs (Appendix B of user manual) I think that update "*" in ansible/latest-packages.yml should instead something like "ohpc-base" and/or "ohpc-base-compute" to get the base-os packages the ohpc packages depend on1, plus "*-ohpc" to update the actual ohpc packages.

However I tried this (with the aim of adding a PR) and on adding a node to the cluster cluster-infra-configure would always fail the first time due to unresolvable package dependencies. If I ran cluster-infra-configure with all updates commented out the node would come up, then I could run cluster-infra-configure again with the above ohpc updates and the playbook would complete but with no changes to packages on the new node. Maybe something else is doing a yum update (during node reboot?) which resolves dependencies?


1 I'm not quite clear which packages the login vs. compute nodes use.

configure_ib task fails - can't find handler

command:

ansible-playbook --vault-password-file monasca-secrets -e @config/steveb.yml -i ansible/inventory-steveb ansible/cluster-infra-configure.yml

error:

...
TASK [configure_ib : Ensure kernel modules persist]
...
ERROR! The requested handler 'Reset interface' was not found in either the main handlers list nor in the listening handlers list

However in p3-appliances/ansible/roles/configure_ib/handlers/main.yml does have a handler with this name which include ./reset_interface.yml. Some testing showed the import seemed to be the problem; import_tasks didn't fix it either.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.