openshift-psap / ci-artifacts Goto Github PK

OpenShift PSAP-team CI Artifacts

License: Apache License 2.0

Shell 16.95% Dockerfile 0.20% Python 73.27% Jinja 4.28% Makefile 0.94% Cuda 0.23% C++ 1.84% RobotFramework 1.04% Jupyter Notebook 1.11% JavaScript 0.13%

ci ansible operator openshift gpu nfd

ci-artifacts's Introduction

⚠⚠⚠⚠⚠⚠⚠⚠⚠⚠⚠⚠⚠⚠⚠⚠⚠⚠⚠⚠⚠⚠⚠⚠

This repository has been deprecated in favor of openshift-psap/topsail . All the ci-artifacts work is continuing there.

⚠⚠⚠⚠⚠⚠⚠⚠⚠⚠⚠⚠⚠⚠⚠⚠⚠⚠⚠⚠⚠⚠⚠⚠

This repository contains Ansible roles and playbooks for OpenShift for automating the interactions with the OpenShift operators under the responsibility Red Hat PSAP team.

Performance & Scale for AI Platforms

To date, this includes:

NVIDIA GPU Operator (most of the repository relates to the deployment, testing and interactions with this operator)
the Special Resource Operator (deployment and testing currently under development)
the Node Feature Discovery
the Node Tuning Operator

The OpenShift version we are supporting is 4.N+1, 4.N, 4.N-1 and 4.N-2, where 4.N is the current latest version released. So as of July 2021, we need to support 4.9 (master), 4.8 (GA), 4.7 and 4.6.

Documentation

See the documentation pages.

Dependencies

Requirements:

See requirements.txt for reference

pip3 install -r requirements.txt
dnf install jq

OpenShift Client (oc)

wget --quiet https://mirror.openshift.com/pub/openshift-v4/x86_64/clients/ocp/latest/openshift-client-linux.tar.gz
tar xf openshift-client-linux.tar.gz oc

An OpenShift cluster accessible with $KUBECONFIG properly set

oc version # fails if the cluster is not reachable

Prow CI

The original purpose of this repository was to perform nightly testing of the OpenShift Operators under responsibility.

This CI testing is performed by OpenShift PROW instance. Is is controlled by the configuration files located in these directories:

The main configuration is written in the config directory, and jobs are then generated with make ci-operator-config jobs. Secondary configuration options can then be modified in the jobs directory.

The Prow CI jobs run in an OpenShift Pod. The ContainerFile is used to build their base-image, and the run prow ... command is used as entrypoint.

From this entrypoint, we trigger the different high-level tasks of the operator end-to-end testing, eg:

run prow gpu-operator test_master_branch
run prow gpu-operator test_operatorhub
run prow gpu-operator validate_deployment_post_upgrade
run prow gpu-operator cleanup_cluster
run prow cluster upgrade

These different high-level tasks rely on the toolbox scripts to automate the deployment of the required dependencies (eg, the NFD operator), the deployment of the operator from its published manifest or from its development repository and its non-regression testing.

CI Dashboard

The artifacts generated during the nightly CI testing are reused to plot a "testing dashboard" that gives an overview of the last days of testing. The generation of this page is performed by the ci-dashboard repository.

Currently, only the GPU Operator results are exposed in this dashboard:

PSAP Operators Toolbox

The PSAP Operators Toolbox is a set of tools, originally written for CI automation, but that appeared to be useful for a broader scope. It automates different operations on OpenShift clusters and operators revolving around PSAP activities: entitlement, scale-up of GPU nodes, deployment of the NFD, SRO and NVIDIA GPU Operators, but also their configuration and troubleshooting.

The entrypoint for the toolbox is the ./run_toolbox.py at the root of this repository. Run it without any arguments to see the list of available commands.

The functionalities of the toolbox commands are described in the documentation page.

Available Toolbox Commands

cluster

./run_toolbox.py cluster capture_environment

NAME
    run_toolbox.py cluster capture_environment - Captures the cluster environment

SYNOPSIS
    run_toolbox.py cluster capture_environment -

DESCRIPTION
    Captures the cluster environment

./run_toolbox.py cluster set_scale

NAME
    run_toolbox.py cluster set_scale - Ensures that the cluster has exactly `scale` nodes with instance_type `instance_type`

SYNOPSIS
    run_toolbox.py cluster set_scale INSTANCE_TYPE SCALE <flags>

DESCRIPTION
    If the machinesets of the given instance type already have the required total number of replicas,
    their replica parameters will not be modified.
    Otherwise,
    - If there's only one machineset with the given instance type, its replicas will be set to the value of this parameter.

    - If there are other machinesets with non-zero replicas, the playbook will fail, unless the 'force_scale' parameter is
    set to true. In that case, the number of replicas of the other machinesets will be zeroed before setting the replicas
    of the first machineset to the value of this parameter."

POSITIONAL ARGUMENTS
    INSTANCE_TYPE
        The instance type to use, for example, g4dn.xlarge
    SCALE
        The number of required nodes with given instance type

FLAGS
    --force=FORCE
        Default: False

NOTES
    You can also use flags syntax for POSITIONAL ARGUMENTS

./run_toolbox.py cluster upgrade_to_image

NAME
    run_toolbox.py cluster upgrade_to_image - Upgrades the cluster to the given image

SYNOPSIS
    run_toolbox.py cluster upgrade_to_image IMAGE

DESCRIPTION
    Upgrades the cluster to the given image

POSITIONAL ARGUMENTS
    IMAGE
        The image to upgrade the cluster to

NOTES
    You can also use flags syntax for POSITIONAL ARGUMENTS

entitlement

./run_toolbox.py entitlement deploy

NAME
    run_toolbox.py entitlement deploy - Deploys a cluster-wide entitlement key & RHSM config file (and optionally a YUM repo certificate) with the help of MachineConfig resources.

SYNOPSIS
    run_toolbox.py entitlement deploy PEM <flags>

DESCRIPTION
    Deploys a cluster-wide entitlement key & RHSM config file (and optionally a YUM repo certificate) with the help of MachineConfig resources.

POSITIONAL ARGUMENTS
    PEM
        Entitlement PEM file

FLAGS
    --pem_ca=PEM_CA
        Type: Optional[]
        Default: None
        YUM repo certificate

NOTES
    You can also use flags syntax for POSITIONAL ARGUMENTS

./run_toolbox.py entitlement inspect

NAME
    run_toolbox.py entitlement inspect - Inspects the cluster entitlement

SYNOPSIS
    run_toolbox.py entitlement inspect -

DESCRIPTION
    Inspects the cluster entitlement

./run_toolbox.py entitlement test_cluster

NAME
    run_toolbox.py entitlement test_cluster - Tests the cluster entitlement

SYNOPSIS
    run_toolbox.py entitlement test_cluster <flags>

DESCRIPTION
    Tests the cluster entitlement

FLAGS
    --no_inspect=NO_INSPECT
        Default: False
        Do not inspect on failure

./run_toolbox.py entitlement test_in_cluster

NAME
    run_toolbox.py entitlement test_in_cluster - Tests a given PEM entitlement key on a cluster

SYNOPSIS
    run_toolbox.py entitlement test_in_cluster PEM_KEY

DESCRIPTION
    Tests a given PEM entitlement key on a cluster

POSITIONAL ARGUMENTS
    PEM_KEY
        The PEM entitlement key to test

NOTES
    You can also use flags syntax for POSITIONAL ARGUMENTS

./run_toolbox.py entitlement test_in_podman

NAME
    run_toolbox.py entitlement test_in_podman - Tests a given PEM entitlement key using a podman container

SYNOPSIS
    run_toolbox.py entitlement test_in_podman PEM_KEY

DESCRIPTION
    Tests a given PEM entitlement key using a podman container

POSITIONAL ARGUMENTS
    PEM_KEY
        The PEM entitlement key to test

NOTES
    You can also use flags syntax for POSITIONAL ARGUMENTS

./run_toolbox.py entitlement undeploy

NAME
    run_toolbox.py entitlement undeploy - Undeploys entitlement from cluster

SYNOPSIS
    run_toolbox.py entitlement undeploy -

DESCRIPTION
    Undeploys entitlement from cluster

./run_toolbox.py entitlement wait

NAME
    run_toolbox.py entitlement wait - Waits for entitlement to be deployed

SYNOPSIS
    run_toolbox.py entitlement wait -

DESCRIPTION
    Waits for entitlement to be deployed

gpu_operator

./run_toolbox.py gpu_operator bundle_from_commit

NAME
    run_toolbox.py gpu_operator bundle_from_commit - Build an image of the GPU Operator from sources (<git repository> <git reference>) and push it to quay.io <quay_image_image>:operator_bundle_gpu-operator-<gpu_operator_image_tag_uid> using the <quay_push_secret> credentials.

SYNOPSIS
    run_toolbox.py gpu_operator bundle_from_commit GIT_REPO GIT_REF QUAY_PUSH_SECRET QUAY_IMAGE_NAME <flags>

DESCRIPTION
    Example parameters - https://github.com/NVIDIA/gpu-operator.git master /path/to/quay_secret.yaml quay.io/org/image_name

    See 'oc get imagestreamtags -n gpu-operator-ci -oname' for the tag-uid to reuse.

POSITIONAL ARGUMENTS
    GIT_REPO
        Git repository URL to generate bundle of
    GIT_REF
        Git ref to bundle
    QUAY_PUSH_SECRET
        A file Kube Secret YAML file with `.dockerconfigjson` data and type kubernetes.io/dockerconfigjson
    QUAY_IMAGE_NAME

FLAGS
    --tag_uid=TAG_UID
        Type: Optional[]
        Default: None
        The image tag suffix to use.

NOTES
    You can also use flags syntax for POSITIONAL ARGUMENTS

./run_toolbox.py gpu_operator capture_deployment_state

NAME
    run_toolbox.py gpu_operator capture_deployment_state - Captures the GPU operator deployment state

SYNOPSIS
    run_toolbox.py gpu_operator capture_deployment_state -

DESCRIPTION
    Captures the GPU operator deployment state

./run_toolbox.py gpu_operator cleanup_bundle_from_commit

NAME
    run_toolbox.py gpu_operator cleanup_bundle_from_commit - Cleanup resources leftover from building a bundle from a commit

SYNOPSIS
    run_toolbox.py gpu_operator cleanup_bundle_from_commit -

DESCRIPTION
    Cleanup resources leftover from building a bundle from a commit

./run_toolbox.py gpu_operator deploy_cluster_policy

NAME
    run_toolbox.py gpu_operator deploy_cluster_policy - Create the ClusterPolicy from the CSV

SYNOPSIS
    run_toolbox.py gpu_operator deploy_cluster_policy -

DESCRIPTION
    Create the ClusterPolicy from the CSV

./run_toolbox.py gpu_operator deploy_from_bundle

NAME
    run_toolbox.py gpu_operator deploy_from_bundle - Deploys the GPU Operator from a bundle

SYNOPSIS
    run_toolbox.py gpu_operator deploy_from_bundle <flags>

DESCRIPTION
    Deploys the GPU Operator from a bundle

FLAGS
    --bundle=BUNDLE
        Type: Optional[]
        Default: None

./run_toolbox.py gpu_operator deploy_from_commit

NAME
    run_toolbox.py gpu_operator deploy_from_commit - Deploys the GPU operator from the given git commit

SYNOPSIS
    run_toolbox.py gpu_operator deploy_from_commit GIT_REPOSITORY GIT_REFERENCE <flags>

DESCRIPTION
    Deploys the GPU operator from the given git commit

POSITIONAL ARGUMENTS
    GIT_REPOSITORY
        The git repository to deploy from, e.g. https://github.com/NVIDIA/gpu-operator.git
    GIT_REFERENCE
        The git ref to deploy from, e.g. master

FLAGS
    --tag_uid=TAG_UID
        Type: Optional[]
        Default: None
        The GPU operator image tag UID. See 'oc get imagestreamtags -n gpu-operator-ci -oname' for the tag-uid to reuse

NOTES
    You can also use flags syntax for POSITIONAL ARGUMENTS

./run_toolbox.py gpu_operator deploy_from_operatorhub

NAME
    run_toolbox.py gpu_operator deploy_from_operatorhub - Deploys the GPU operator from OperatorHub

SYNOPSIS
    run_toolbox.py gpu_operator deploy_from_operatorhub <flags>

DESCRIPTION
    Deploys the GPU operator from OperatorHub

FLAGS
    --version=VERSION
        Type: Optional[]
        Default: None
        The version to deploy. If unspecified, deploys the latest version available in OperatorHub. Run the toolbox gpu_operator list_version_from_operator_hub subcommand to see the available versions.
    --channel=CHANNEL
        Type: Optional[]
        Default: None
        Optional channel to deploy from.

./run_toolbox.py gpu_operator run_gpu_burn

NAME
    run_toolbox.py gpu_operator run_gpu_burn - Runs the GPU burn on the cluster

SYNOPSIS
    run_toolbox.py gpu_operator run_gpu_burn <flags>

DESCRIPTION
    Runs the GPU burn on the cluster

FLAGS
    --runtime=RUNTIME
        Type: Optional[]
        Default: None
        How long to run the GPU for, in seconds

./run_toolbox.py gpu_operator set_repo_config

NAME
    run_toolbox.py gpu_operator set_repo_config - Sets the GPU-operator driver yum repo configuration file

SYNOPSIS
    run_toolbox.py gpu_operator set_repo_config REPO_FILE <flags>

DESCRIPTION
    Sets the GPU-operator driver yum repo configuration file

POSITIONAL ARGUMENTS
    REPO_FILE
        Absolute path to the repo file

FLAGS
    --dest_dir=DEST_DIR
        Type: Optional[]
        Default: None
        The destination dir in the pod to place the repo in

NOTES
    You can also use flags syntax for POSITIONAL ARGUMENTS

./run_toolbox.py gpu_operator undeploy_from_commit

NAME
    run_toolbox.py gpu_operator undeploy_from_commit - Undeploys a GPU-operator that was deployed from a commit

SYNOPSIS
    run_toolbox.py gpu_operator undeploy_from_commit -

DESCRIPTION
    Undeploys a GPU-operator that was deployed from a commit

./run_toolbox.py gpu_operator undeploy_from_operatorhub

NAME
    run_toolbox.py gpu_operator undeploy_from_operatorhub - Undeploys a GPU-operator that was deployed from OperatorHub

SYNOPSIS
    run_toolbox.py gpu_operator undeploy_from_operatorhub -

DESCRIPTION
    Undeploys a GPU-operator that was deployed from OperatorHub

./run_toolbox.py gpu_operator wait_deployment

NAME
    run_toolbox.py gpu_operator wait_deployment - Waits for the GPU operator to deploy

SYNOPSIS
    run_toolbox.py gpu_operator wait_deployment -

DESCRIPTION
    Waits for the GPU operator to deploy

local_ci

./run_toolbox.py local_ci cleanup

NAME
    run_toolbox.py local_ci cleanup - Clean the local CI artifacts

SYNOPSIS
    run_toolbox.py local_ci cleanup -

DESCRIPTION
    Clean the local CI artifacts

./run_toolbox.py local_ci deploy

NAME
    run_toolbox.py local_ci deploy - Runs a given CI command

SYNOPSIS
    run_toolbox.py local_ci deploy CI_COMMAND GIT_REPOSITORY GIT_REFERENCE <flags>

DESCRIPTION
    Runs a given CI command

POSITIONAL ARGUMENTS
    CI_COMMAND
        The CI command to run, for example "run gpu-ci"
    GIT_REPOSITORY
        The git repository to run the command from, e.g. https://github.com/openshift-psap/ci-artifacts.git
    GIT_REFERENCE
        The git ref to run the command from, e.g. master

FLAGS
    --tag_uid=TAG_UID
        Type: Optional[]
        Default: None
        The local CI image tag UID

NOTES
    You can also use flags syntax for POSITIONAL ARGUMENTS

nfd

./run_toolbox.py nfd has_gpu_nodes

NAME
    run_toolbox.py nfd has_gpu_nodes - Checks if the cluster has GPU nodes

SYNOPSIS
    run_toolbox.py nfd has_gpu_nodes -

DESCRIPTION
    Checks if the cluster has GPU nodes

./run_toolbox.py nfd has_labels

NAME
    run_toolbox.py nfd has_labels - Checks if the cluster has NFD labels

SYNOPSIS
    run_toolbox.py nfd has_labels -

DESCRIPTION
    Checks if the cluster has NFD labels

./run_toolbox.py nfd wait_gpu_nodes

NAME
    run_toolbox.py nfd wait_gpu_nodes - Wait until nfd find GPU nodes

SYNOPSIS
    run_toolbox.py nfd wait_gpu_nodes -

DESCRIPTION
    Wait until nfd find GPU nodes

./run_toolbox.py nfd wait_labels

NAME
    run_toolbox.py nfd wait_labels - Wait until nfd labels the nodes

SYNOPSIS
    run_toolbox.py nfd wait_labels -

DESCRIPTION
    Wait until nfd labels the nodes

nfd_operator

./run_toolbox.py nfd_operator deploy_from_commit

NAME
    run_toolbox.py nfd_operator deploy_from_commit - Deploys the NFD operator from the given git commit

SYNOPSIS
    run_toolbox.py nfd_operator deploy_from_commit GIT_REPO GIT_REF <flags>

DESCRIPTION
    Deploys the NFD operator from the given git commit

POSITIONAL ARGUMENTS
    GIT_REPO
    GIT_REF
        The git ref to deploy from, e.g. master

FLAGS
    --image_tag=IMAGE_TAG
        Type: Optional[]
        Default: None
        The NFD operator image tag UID.

NOTES
    You can also use flags syntax for POSITIONAL ARGUMENTS

./run_toolbox.py nfd_operator deploy_from_operatorhub

NAME
    run_toolbox.py nfd_operator deploy_from_operatorhub - Deploys the GPU Operator from OperatorHub

SYNOPSIS
    run_toolbox.py nfd_operator deploy_from_operatorhub <flags>

DESCRIPTION
    Deploys the GPU Operator from OperatorHub

FLAGS
    --channel=CHANNEL
        Type: Optional[]
        Default: None

./run_toolbox.py nfd_operator undeploy_from_operatorhub

NAME
    run_toolbox.py nfd_operator undeploy_from_operatorhub - Undeploys an NFD-operator that was deployed from OperatorHub

SYNOPSIS
    run_toolbox.py nfd_operator undeploy_from_operatorhub -

DESCRIPTION
    Undeploys an NFD-operator that was deployed from OperatorHub

repo

./run_toolbox.py repo validate_role_files

NAME
    run_toolbox.py repo validate_role_files - Ensures that all the Ansible variables defining a filepath (`roles/`) do point to an existing file.

SYNOPSIS
    run_toolbox.py repo validate_role_files -

DESCRIPTION
    Ensures that all the Ansible variables defining a filepath (`roles/`) do point to an existing file.

./run_toolbox.py repo validate_role_vars_used

NAME
    run_toolbox.py repo validate_role_vars_used - Ensure that all the Ansible variables defined are actually used in their role (with an exception for symlinks)

SYNOPSIS
    run_toolbox.py repo validate_role_vars_used -

DESCRIPTION
    Ensure that all the Ansible variables defined are actually used in their role (with an exception for symlinks)

sro

./run_toolbox.py sro capture_deployment_state

NAME
    run_toolbox.py sro capture_deployment_state

SYNOPSIS
    run_toolbox.py sro capture_deployment_state -

./run_toolbox.py sro deploy_from_commit

NAME
    run_toolbox.py sro deploy_from_commit - Deploys the SRO operator from the given git commit

SYNOPSIS
    run_toolbox.py sro deploy_from_commit GIT_REPO GIT_REF <flags>

DESCRIPTION
    Deploys the SRO operator from the given git commit

POSITIONAL ARGUMENTS
    GIT_REPO
        The git repository to deploy from, e.g. https://github.com/openshift-psap/special-resource-operator.git
    GIT_REF
        The git ref to deploy from, e.g. master

FLAGS
    --image_tag=IMAGE_TAG
        Type: Optional[]
        Default: None
        The SRO operator image tag UID.

NOTES
    You can also use flags syntax for POSITIONAL ARGUMENTS

./run_toolbox.py sro run_e2e_test

NAME
    run_toolbox.py sro run_e2e_test - Runs e2e test on the given SRO repo and ref

SYNOPSIS
    run_toolbox.py sro run_e2e_test GIT_REPO GIT_REF

DESCRIPTION
    Runs e2e test on the given SRO repo and ref

POSITIONAL ARGUMENTS
    GIT_REPO
        The git repository to deploy from, e.g. https://github.com/openshift-psap/special-resource-operator.git
    GIT_REF
        The git ref to deploy from, e.g. master

NOTES
    You can also use flags syntax for POSITIONAL ARGUMENTS

./run_toolbox.py sro undeploy_from_commit

NAME
    run_toolbox.py sro undeploy_from_commit - Undeploys an SRO-operator that was deployed from commit

SYNOPSIS
    run_toolbox.py sro undeploy_from_commit GIT_REPO GIT_REF

DESCRIPTION
    Undeploys an SRO-operator that was deployed from commit

POSITIONAL ARGUMENTS
    GIT_REPO
        The git repository to undeploy, e.g. https://github.com/openshift-psap/special-resource-operator.git
    GIT_REF
        The git ref to undeploy, e.g. master

NOTES
    You can also use flags syntax for POSITIONAL ARGUMENTS

ci-artifacts's People

Contributors

Stargazers

Watchers

ci-artifacts's Issues

Ensure that a GPU node is available before deploying the GPU Operator

Currently, before deploying the GPU Operator on the CI, we do this:

prepare_cluster_for_gpu_operator() {
    entitle
    toolbox/scaleup_cluster.sh
    toolbox/nfd/deploy_from_operatorhub.sh
}

but we never test that NFD correctly labels the nodes and that GPU nodes are indeed available.

So prepare_cluster_for_gpu_operator should be extended with:

toolbox/nfd/wait_for_gpu_nodes.sh

that would wait 5min for a node with this label to show up:

var gpuNodeLabels = map[string]string{
	"feature.node.kubernetes.io/pci-10de.present":      "true",
	"feature.node.kubernetes.io/pci-0302_10de.present": "true",
	"feature.node.kubernetes.io/pci-0300_10de.present": "true",
}

Future Release Branches Frozen For Merging | branch:release-4.9

The following branches are being fast-forwarded from the current development branch (release-4.8) as placeholders for future releases. No merging is allowed into these release branches until they are unfrozen for production release.

release-4.9

Contact the Test Platform or Automated Release teams for more information.

Create a bot to highlight weekly PR`s

Explore if we can have a bot to publish weekly PR status and PR`s that need some attention

Include "p3.2xlarge" GPU instance during CI

Include the "p3.2xlarge" GPU instance as well during CI along with the cheaper "g4dn.xlarge"

When we onboard performance tests in the CI, we should run the training workloads on the p3 instance and the inference workloads in the g4dn instance

Use Ansible role "template" files instead of custom sed replacement

We need to have a look at the template built-in feature and see how we could use is instead of using sed:

- name: "Create the OperatorHub subscription for {{ gpu_operator_csv_name }}"
  shell:
    set -o pipefail;
    cat {{ gpu_operator_operatorhub_sub }}
    | sed 's|{{ '{{' }} startingCSV {{ '}}' }}|{{ gpu_operator_csv_name }}|'
    | oc apply -f-
  args:
    warn: false # don't warn about using sed here

This change was "anticipated", as we already use the jinga2 remplate style for our template files, eg:

apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: gpu-operator-certified
  namespace: openshift-operators
spec:
  channel: stable
  installPlanApproval: Manual
  name: gpu-operator-certified
  source: certified-operators
  sourceNamespace: openshift-marketplace
  startingCSV: "{{ startingCSV }}"

Example:

- name: Store the CSV version
  set_fact:
    startingCSV: "{{ gpu_operator_csv_name }}"

- name: "Create the OperatorHub subscription for {{ gpu_operator_csv_name }}"
  template:
    src: "{{ gpu_operator_operatorhub_sub }}"
    dest: "{{ artifact_extra_logs_dir }}/gpu_operator_sub.yml"

TODO: List of the roles with sed transformation removed:

Fix yaml Lints

after running Yamllint some lint error were found

 playbooks/gpu-burst.yml
  5:17      warning  truthy value should be true or false  (truthy)
  9:19      error    trailing spaces  (trailing-spaces)
playbooks/nvidia-gpu-operator-ci.yml
  11:1      error    trailing spaces  (trailing-spaces)
playbooks/openshift-psap-ci.yml
  12:1      error    trailing spaces  (trailing-spaces)
  23:1      error    trailing spaces  (trailing-spaces)
roles/check_deps/tasks/main.yml
  6:18      warning  truthy value should be true or false  (truthy)
  12:18     warning  truthy value should be true or false  (truthy)
  18:18     warning  truthy value should be true or false  (truthy)
  21:8      error    trailing spaces  (trailing-spaces)
  28:44     error    trailing spaces  (trailing-spaces)
  29:1      error    trailing spaces  (trailing-spaces)
  32:14     error    trailing spaces  (trailing-spaces)
roles/nv_gpu/files/001_namespace.yaml
  5:1       error    too many blank lines (1 > 0)  (empty-lines)
roles/nv_gpu/files/003_operator_sub.yaml
  5:36      error    trailing spaces  (trailing-spaces)
  8:31      error    trailing spaces  (trailing-spaces)
  9:30      error    trailing spaces  (trailing-spaces)
  10:41     error    trailing spaces  (trailing-spaces)
roles/nv_gpu/tasks/ci_checks.yml
  5:32      error    trailing spaces  (trailing-spaces)
  6:15      error    trailing spaces  (trailing-spaces)
  13:46     error    trailing spaces  (trailing-spaces)
  21:11     error    trailing spaces  (trailing-spaces)
  27:1      error    too many blank lines (1 > 0)  (empty-lines)
roles/nv_gpu/tasks/install.yml
  4:37      error    trailing spaces  (trailing-spaces)
  11:1      error    trailing spaces  (trailing-spaces)
  14:41     error    trailing spaces  (trailing-spaces)
  21:1      error    trailing spaces  (trailing-spaces)
  24:52     error    trailing spaces  (trailing-spaces)
  31:1      error    trailing spaces  (trailing-spaces)
  34:41     error    trailing spaces  (trailing-spaces)
  41:1      error    trailing spaces  (trailing-spaces)
roles/nv_gpu/tasks/main.yml
  8:1       error    trailing spaces  (trailing-spaces)
  9:26      error    trailing spaces  (trailing-spaces)
roles/openshift_nfd/tasks/ci_checks.yml
  5:32      error    trailing spaces  (trailing-spaces)
  6:15      error    trailing spaces  (trailing-spaces)
  13:24     error    trailing spaces  (trailing-spaces)
  19:15     error    trailing spaces  (trailing-spaces)
roles/openshift_nfd/tasks/main.yml
  17:1      error    trailing spaces  (trailing-spaces)
  50:68     error    trailing spaces  (trailing-spaces)
  53:37     error    trailing spaces  (trailing-spaces)
  57:60     error    trailing spaces  (trailing-spaces)
  62:50     error    trailing spaces  (trailing-spaces)
  80:1      error    trailing spaces  (trailing-spaces)
  81:26     error    trailing spaces  (trailing-spaces)
roles/openshift_nfd/tasks/uninstall_nfd.yml
  14:32     error    trailing spaces  (trailing-spaces)
  18:1      error    too many blank lines (1 > 0)  (empty-lines)
roles/openshift_node/tasks/aws.yml
  80:17     error    trailing spaces  (trailing-spaces)
  106:1     error    trailing spaces  (trailing-spaces)
  112:1     error    trailing spaces  (trailing-spaces)
roles/openshift_node/tasks/main.yml
  2:63      error    trailing spaces  (trailing-spaces)
roles/openshift_node/tasks/scaleup_checks.yml
  19:9      error    trailing spaces  (trailing-spaces)
  26:133    error    trailing spaces  (trailing-spaces)
  33:38     error    trailing spaces  (trailing-spaces)
  34:8      error    trailing spaces  (trailing-spaces)
  35:47     error    trailing spaces  (trailing-spaces)
  36:11     error    too many spaces after colon  (colons)
  37:25     error    trailing spaces  (trailing-spaces)
  41:89     error    trailing spaces  (trailing-spaces)
  48:40     error    trailing spaces  (trailing-spaces)
  49:8      error    trailing spaces  (trailing-spaces)
  50:49     error    trailing spaces  (trailing-spaces)
  51:11     error    too many spaces after colon  (colons)
  52:25     error    trailing spaces  (trailing-spaces)
roles/openshift_odh/tasks/install_required_pkgs.yml
  8:11      warning  truthy value should be true or false  (truthy)
  16:11     warning  truthy value should be true or false  (truthy)
  36:17     warning  truthy value should be true or false  (truthy)
  41:11     warning  truthy value should be true or false  (truthy)
  74:19     warning  truthy value should be true or false  (truthy)
  79:13     warning  truthy value should be true or false  (truthy)
roles/openshift_odh/tasks/main.yml
  21:43     error    trailing spaces  (trailing-spaces)
  27:38     error    trailing spaces  (trailing-spaces)
  45:39     error    trailing spaces  (trailing-spaces)
roles/openshift_odh/tasks/uninstall_odh.yml
  21:80     error    trailing spaces  (trailing-spaces)
  25:32     error    trailing spaces  (trailing-spaces)
roles/openshift_sro/tasks/main.yml
  17:42     error    trailing spaces  (trailing-spaces)
  20:37     error    trailing spaces  (trailing-spaces)
roles/openshift_sro/tasks/uninstall_sro.yml
  14:32     error    trailing spaces  (trailing-spaces)

Future Release Branches Frozen For Merging | branch:release-4.9

release-4.9

Contact the Test Platform or Automated Release teams for more information.

Operator install from OperatorHub fails because the package is not found

From time to time, in the CI, NFD or the GPU Operator fail to be installed from OperatorHub, because the PackageManifest is not found:

<command> oc get packagemanifests/gpu-operator-certified -n openshift-marketplace
<stderr> Error from server (NotFound): packagemanifests.packages.operators.coreos.com "gpu-operator-certified" not found

<command> oc get packagemanifests/nfd -n openshift-marketplace
<stderr> Error from server (NotFound): packagemanifests.packages.operators.coreos.com "nfd" not found

ci-artifacts maintenance overview

Some things I have in mind for improving/fixing ci-artifacts:

Fix the GPU Operator deploy_from_operatorhub to work with v1.9.0-beta and v1.9.0 when released
- the single-namespace + entitlement-free new features change the way the GPU Operator is deployed, I addressed it for master branch testing, but I couldn't do it for OperatorHub deployment until released by NVIDIA (beta was released last week)
- Fixed with #289
~~Update gpu_operator_set_namespace to use ClusterPolicy.status.namespace (see PR)~~
- ~~will be simpler that the code I wrote before this PR was merged~~
- EDIT: WONT FIX, oc get pod -l app.kubernetes.io/component=gpu-operator -A -ojsonpath={.items[].metadata.namespace} is simple enough
Enable testing the GPU Operator v1.9 (when released, ie > 2021-12-03)
- Done with #293
- Done with openshift/release#24239
Call hack/must-gather.sh script instead of custom scripts
- Done with #294
Turn the image helper BuildConfig into a simple DockerFile + quay.io "build on master-merge"
- this image is 100% static, never updated, there's no need to rebuilt it for every master GPU Operator test
- this image is duplicated in NFD master test
- takes 8 minutes to build

2021-11-28 23:32:23,960 p=90 u=psap-ci-runner n=ansible | Sunday 28 November 2021  23:32:23 +0000 (0:00:00.697)       0:00:08.016 ******* 
2021-11-28 23:32:24,402 p=90 u=psap-ci-runner n=ansible | TASK: gpu_operator_bundle_from_commit : Wait for the helper image to be built

Double check the alert-testing of the GPU Operator master branch
- I'm doubtful about what happens with the driver-cannot-be-built alert wrt entitlement-free deployments
Refresh the versions used for the GPU Operator upgrade testing (currently only 4.6 --> 4.7)
- Removed master 4.6 --> 4.7 upgrade testing (both versions are not supported anymore)
  - openshift/release#24239
Confirm the fate of testing the GPU Operator on OCP 4.6 clusters
- EUS release
Enable testing of the GPU Operator on OCP 4.10
Improve the GPU Operator and rewrite gpu_operator_get_csv_version
- I think it's becoming critical that the GPU Operator image exposes the GPU Operator version (v1.x.y) and the git commit used to build it
- in the CI I already include the commit hash + commit date in the master-branch bundle version (eg, 21.11.25-git.57914a2), but that's not enough as this information isn't part of the operator image (recently the CI was using the same outdated image for a week and we failed to notice it until we had to test custom GPU Operator PRs)
- once this is done, update this role to avoid fetching the information from the CSV, whenever possible (won't be backported)

Future Release Branches Frozen For Merging | branch:release-4.9

release-4.9

Contact the Test Platform or Automated Release teams for more information.

Measure the code coverage of the presubmit tests

In parallel of implementing unit testing (#92), it would be interesting to measure the code coverage of the GPU Operator testing + Unit testing, to ensure that all the playbook tasks are executed at least once as part of the presubmit testing.

This ticket will track the design and development of this task.

Cannot modify the GPU Operator ClusterPolicy before deploying it

Currently, the GPU Operator ClusterPolicy is fetched from the ClusterServiceVersion alm-example and instantiated right away.

However, in some cases, the default content is not the one we desire. See for instance this unmerged commit, where we need to set the repoConfig stanza when running with OCP 4.8 (using RHEL beta repositories).

    toolbox/gpu-operator/deploy_from_operatorhub.sh 
    [...]

    if oc version | grep -q "Server Version: 4.8"; then
        echo "Running on OCP 4.8, enabling RHEL beta repository"
        ./toolbox/gpu-operator/set_repo-config.sh --rhel-beta
    fi

    toolbox/gpu-operator/wait_deployment.sh

Another example would be when we want to customize the operator or operand image path to use custom ones.

The GPU Operator DaemonSets are never updated once created, so if they are created with the wrong values, the DaemonSets will never be fixed.

The hack above works (hopefully) because the driver container will fail to deploy without the right repoConfig configuration, so it's safe to manually delete it after the update, but in the general case, the Driver container should never be deleted once running, as the nvidia driver cannot be removed from the kernel while other process (workload or operand) use it.

We should find a way to allow patching the ClusterPolicy before deploying it. The solution should be generic, so that any kind of modification can be performed during the deployment.

Detect when the GPU Operator fails because of a cluster upgrade

Until v1.6.2 (included) of the GPU Operator, OpenShift cluster upgrade is not supported, because the driver DaemonSet receives the RHEL_VERSION as a DaemonSet/Pod environment variable.

This makes the driver Pod unable to build the nvidia driver, because it cannot fetch any package.
Example of lines from the driver logs:

+ echo -e 'Starting installation of NVIDIA driver version 460.32.03 for Linux kernel version 4.18.0-240.22.1.el8_3.x86_64\n'

el8_3 means that it's RHEL 8.3 kernel that is running, but

dnf -q -y --releasever=8.2

this --releasever=8.2 shows that the DaemonSet is configured with RHEL_VERSION=8.2

It can would be easy to check if OCP_VERSION in the driver daemonset matches ocp_release variable that we already capture in the ansible playbooks.

This test can be integrated in the diagnose.sh script.

Detect when the nightly CI fails because of a cluster shutdown

Every once in a while, the nightly testing fails because the cluster becomes unreachable:

roles/capture_environment/tasks/main.yml:8
TASK: capture_environment : Store OpenShift YAML version
----- FAILED ----
msg: non-zero return code

<command> oc version -oyaml > /logs/artifacts/233800__cluster__capture_environment/ocp_version.yml

<stderr> The connection to the server api.ci-op-75fhpdb3-3c6fc.origin-ci-int-aws.dev.rhcloud.com:6443 was refused - did you specify the right host or port?
----- FAILED ----

This kind of failure is independent from the GPU Operator testing, and it should be made clear in the CI-Dashboard (Prow infrastructure restarts the testing when this happens). An orange dot could do the job, with a label cluster issue detected.

To detect this, the must-gather script could simply create a file cluster-down when oc version doesn't work.
This the presence of this file would tell the ci-dashboard to set the orange flag.

GPU Operator: run gpu-operator test_operatorhub should not specify the exact operator version

Currently, we are nightly testing the released versions of the GPU Operator with this entrypoint:

run gpu-operator test_operatorhub 1.8.0 v1.8

where 1.8.0 specifies the version of the operator to be installed, and v1.8 specifies the OLM channel.

It would be better to only specify the channel, so that the latest minor version is installed:

run gpu-operator test_operatorhub v1.8

When the gpu_operator_deploy_from_operatorhub role was originally written (before v1.7), the GPU Operator only had a stable channel, so the channel could be omitted, and the full version had to be specified.
Now that NVIDIA switched to a dedicated channel per 1.X release, we could update the role/entrypoint to test "the latest minor available for a given channel".

In addition, the ci-dashboard was recently updated to show the exact version of the GPU Operator being installed & tested (instead of a hard-coded value):

Use the bundle deployment to test a specific commit of the GPU Operator

Currently, when we want to test a specific commit of the GPU Operator, we internally use the GPU Operator helm-chart to configure and deploy the resources.

test_commit() {
    CI_IMAGE_GPU_COMMIT_CI_REPO="${1:-https://github.com/NVIDIA/gpu-operator.git}"
    CI_IMAGE_GPU_COMMIT_CI_REF="${2:-master}"

    CI_IMAGE_GPU_COMMIT_CI_IMAGE_UID="ci-image"

    echo "Using Git repository ${CI_IMAGE_GPU_COMMIT_CI_REPO} with ref ${CI_IMAGE_GPU_COMMIT_CI_REF}"

    prepare_cluster_for_gpu_operator
    toolbox/gpu-operator/deploy_from_commit.sh "${CI_IMAGE_GPU_COMMIT_CI_REPO}" \
                                               "${CI_IMAGE_GPU_COMMIT_CI_REF}" \
                                               "${CI_IMAGE_GPU_COMMIT_CI_IMAGE_UID}"
    validate_gpu_operator_deployment
}

This is working properly, however it would be better to test the bundle resources, as it's the method that will be used to deploy on OpenShift, including in the nightly testing of the master branch.

Add the ability to entitle only GPU nodes

Currently, the entitlement is performed cluster-wide, so all the nodes of the cluster have to be rebooted when the entitlement is deployed.

In order to avoid rebooting nodes that do not require entitlement, we need to update

the MachineConfig resources to target only a specific set of nodes
the MachineSet to apply a label the node when it gets created (instead of relying on NFD to discover that is has a GPU)
the entitlement test pod, to make sure it lands on an entitled node.

I think it would be good to keep the existing behavior as default for the toolbox commands, but add a --label ... to support this optimization.

Ensure that GPU Burn can run after the GPU Operator is deployed

Currently, we only test the successful deployment of the GPU Operator on the cluster, but we do not run any GPU payload on it.

We should extent the playbooks to run GPU Burn, and find the way to run it on every GPU available (ie, not only the 1st one of the node)

https://github.com/openshift-psap/gpu-burn/blob/master/gpu-burn.yaml

entitlement: using the same content for the entitlement.pem and entitlement-key.pem isn't safe

As per this issue openshift-psap/blog-artifacts#6, using the same content for entitlement.pem and entitlement-key.pem isn't safe,

as confirmed by this command:

$ NAME=key
$ podman run --rm -it -v $KEY:/etc/pki/entitlement/$NAME.pem registry.access.redhat.com/ubi8-minimal:8.3-298 bash -x -c "cp /etc/pki/entitlement/$NAME.pem /etc/pki/entitlement/$NAME-key.pem; microdnf install kernel-devel"

+ cp /etc/pki/entitlement/key.pem /etc/pki/entitlement/key-key.pem
+ microdnf install kernel-devel
Downloading metadata...
Downloading metadata...
Downloading metadata...
Downloading metadata...
error: cannot update repo 'rhel-8-for-x86_64-baseos-rpms': Cannot download repomd.xml: Cannot download repodata/repomd.xml: All mirrors were tried; Last error: Curl error (58): Problem with the local SSL certificate for https://cdn.redhat.com/content/dist/rhel8/8/x86_64/baseos/os/repodata/repomd.xml [unable to set private key file: '/etc/pki/entitlement/key-key-key.pem' type PEM]

NAME=entite --> doesn't work
NAME=entitlement --> works

Force Ansible "connection: local" for running all the commands

This avoids going through an ssh connection to run a local task.

This happens by default in my system (Fedora 32), but I've seen issues where people had to manually allow ssh localhost to work without a password.

Delete 'release-4.x' branches

I create this issue to discuss the topic and understand if the idea makes sense.

Currently, the code of this repository isn't specific to any version of OpenShift, so I would to suggest to get rid of the release-4.x branches and use a simpler workflow with a master branch defining the way to test the GPU-Operator on all the OpenShift releases.

See this commit kpouget/release@3790f84 for the patch that should be applied to openshift-release repository.

Typos

./roles/entitlement_test_wait_deployment/defaults/main/config.yml:2: successfull ==> successful
./roles/gpu_operator_run_gpu-burn/tasks/main.yml:53: Instanciate ==> Instantiate
./roles/nfd_test_wait_labels/tasks/main.yml:4: quering ==> querying

Allow scaling a cluster up and down with N nodes

Currently, toolbox/cluster/scaleup.sh [instance-type] allows only adding new MachineSets with a given instance-type.
For testing the GPU Operator support of scale-up and scale-down, we would need to be able to add new GPU nodes to a cluster, and potentially scale it to 0.

So,

1/ toolbox/cluster/scaleup.sh should be extended (or with another command) to be able to set the number of desired nodes of a given instance type, eg:

toolbox/cluster/scaleup.sh <instance-type> # make sure that machines with <instance-type> are available
toolbox/cluster/scaleup.sh <instance-type> N # make sure that N machines with <instance-type> are available

2/ the nightly CI entrypoint should be extended to ensure that this capability works properly in the GPU Operator, with tests like:

Deploy the GPU Operator when 0 GPU nodes are available
Scale-up the cluster to 1 GPU node, make sure that the GPU of the nodes gets available
Scale-up the cluster to 2 GPU nodes, make sure that the 2 GPUs get available
Scale-up the cluster to 1 GPU node, make sure that the node disappears

Implement some unit testing for toolbox scripts

The toolbox scripts are used to test the deployment of the GPU operator, so most of their code (ansible playbooks and roles) is tested before merging a new PR in the master branch (\test gpu-operator-e2e) and in the nightly testing:

But the GPU operator doesn't cover 100% of the toolbox features, and some flags and code branches might be left untested. This is for instance what happened with toolbox/entitlement/test.sh that got broken when no flag was passed (see the fix in cf8a276), which isn't executed in the GPU Operator testing.

This ticket will track the progress of the design and development of unit tests.

Create a full weekly suite for the PSAP operators suite

We should have a BIG test that runs weekly, with chaos testing and other best test practices paths

the idea would be to:

run the basic installs, let it run for a couple of minutes
run a very small ML perf benchmark
run a chaos run, randomly deleting components from the operators and monitoring if the operators recover from it
run a scale up and scale down test (GPU and NFD test)
run a cluster upgrade

This will test and stress PSAP operators to common real world scenarios so we can be prepared

Documentation

Update README
and Also write proper documentation about the available playbooks

Test OpenShift upgrade with GPU workload running

Currently, we for the upgrade scenario, we ...

install and test the GPU Operator
trigger the cluster upgrade
test the GPU Operator.

We need to add two steps:

2.5 start a long running GPU workload (gpu burn, but without waiting for completion)
3.5 test what happened to the workload (eg, wait for it to be restarted and running)

oc isn't part of the CI image

Currently we download oc, (kubectl), helm and operator-sdk as part of the precheck() call of build/root/usr/local/bin/run.

I think it would be better to fetch these binaries when building the image.

GPU Operator: test PROXY configuration

We currently do not have any test validating the GPU Operator connected to the Internet through a Cluster PROXY.

This config appeared to be buggy in the GPU Operator 1.8.0 and 1.8.1, due to indeterministic ordering of the driver-container env entries, leading to a constant update of the driver DaemonSet and recreation of the Pods.

We should have test case covering this use-case, maybe running once a week, maybe with an in-cluster proxy relay as a first step.

Test the cluster upgrade support of the GPU Operator

The GPU Operator should support seamlessly the upgrade of the OpenShift cluster (in a forth coming release at least).

We want to be able to test this upgrade support in the nightly CI.

OpenShift Prow doesn't support our upgrade use case, which is:

install and test the GPU Operator as usual
upgrade the cluster
test the GPU Operator

(operator are usually preinstalled in OpenShift, or straightforward to install via OperatorHub, but the GPU Operator requires the deployment of the entitlement, the scale-up of the cluster with the GPU operator and the deployment of NFD ...).

This ticket will track the development of this feature.

set_scale.sh: cannot specify the source machineset

Currently, ./toolbox/cluster/set_scale.sh cannot be customized to decide which MachineSet will be used to derive the new MachineSet:

- name: Get the names of an existing worker machinesets (of any instance type)
  command:
    oc get machinesets -n openshift-machine-api -o
    jsonpath='{range .items[?(@.spec.template.metadata.labels.machine\.openshift\.io/cluster-api-machine-role=="worker")]}{.metadata.name}{"\n"}{end}'
  register: oc_get_machinesets
  failed_when: not oc_get_machinesets.stdout

it would be nice to have the ability to easily override oc_get_machinesets.stdout to specify which machineset to use as a base.

Quick and dirty example of what I did to work around that:

- name: Get the names of an existing worker machinesets (of any instance type)
  command:
    echo kpouget-20210519-kf6rn-worker-eu-central-1b
  register: oc_get_machinesets
  failed_when: not oc_get_machinesets.stdout

The reason for that is that the instance-type I want isn't available in eu-central-1a, only in 1b

Prow CI: Upgrade config not using predefined steps

Currently, the cluster upgrade testing is performed "manually" in the cluster_upgrade_to_image role.

This simple playbook only waits for the end of the upgrade, but doesn't perform any other kind of test.

The reason for this choice is that

the Prow CI steps for upgrading the cluster do not support running custom repository tests, which is mandatory for the GPU Operator (entitlement, installation of dependency, initial deployment and validation of GPU Operator).
we wanted to be able to validate rapidly NVIDIA implementation of the cluster upgrade support.

In the future, it would be important to move to a proper CI upgrade step.

Future Release Branches Frozen For Merging | branch:release-4.9

release-4.9

Contact the Test Platform or Automated Release teams for more information.

Generic command for installing operators from OperatorHub

./run_toolbox.py nfd_operator deploy_from_operatorhub

./run_toolbox.py gpu_operator deploy_from_operatorhub
${OPERATOR_CHANNEL:-}
${OPERATOR_VERSION:-}
--namespace ${OPERATOR_NAMESPACE}

the behaviour of these two commands must be very similar,
I think it shouldn't be hard to rewrite ./run_toolbox.py gpu_operator deploy_from_operatorhub into a generic command, something like:

deploy_from_operatorhub
--catalog=certified-operators
--name=gpu-operator
--namespace=... # optional
--channel=v1.9.0 # optional, can use defaultChannel
--csv-name=... # optional, can use defaultCSV
--deploy-default-cr=True

this would allow us installing any operator from the command line.

#300 gpu_operator_run_gpu-burn: make gpu-burn execution easier to reproduce
#301 benchmarking: make the execution easier to reproduce

could rewritten with these ^^^ two PRs in mind, to make the install easy to reproduce with the execution artifacts.

Learn about ansible TAGS and refactor the roles to us it

I am watching an Ansible course as part of Red Hat Day of Learning, and they explain the concept of Ansible tags: execute only the tasks matching a command-line --tags name or --skip-tags name2

This feature could be useful for us, it needs to be investigated.

Rewrite the toolbox scripts with a proper CLI parameters handling framework

from @omertuc's words:

the ad-hoc / messy arg parsing each script does now is getting a bit out of hand, maybe we should consider transforming the toolbox from a bunch of shell scripts to something like https://click.palletsprojects.com/en/8.0.x/ or https://github.com/google/python-fire (or anything else we like from this list)

This will be more maintainable / give us --help for free.

Refactor the ansible roles

We are currently using some roles (eg nv_gpu) to perform many different tasks, depending on the flags we activate:

  - name: Install NFD-operator from OperatorHub
    include_tasks: roles/nv_gpu/tasks/install_nfd.yml
    when: install_nfd_operator_from_hub == "yes"

  - name: Wait for NFD-labeled GPU nodes to appear
    include_tasks: roles/nv_gpu/tasks/test_nfd_gpu.yml
    when: nfd_test_gpu_nodes == "yes"

  - name: Install GPU-operator from OperatorHub
    include_tasks: roles/nv_gpu/tasks/install_nv.yml
    when: install_gpu_operator_from_hub == "yes"

I don't think this is the way ansible roles are supposed to be used, as it causes many tasks to be shown as "skipped", but still visible in the logs.

This ticket will track the refactoring of these big roles into smaller chunks, doing only one task (=one role per toolbox script, more or less).

Ansible-lint tests only modified files

By default ansible-lint only tests the files modified by the PR, and hence never ran if over the full repository.

$ ansible-lint -v --force-color -c config/ansible-lint.yml playbooks roles
INFO     Discovering files to lint: git ls-files -z

$ ansible-lint -v --force-color -c config/ansible-lint.yml $(find . -name *.yml)
# .ansible-lint
warn_list:  # or 'skip_list' to silence them completely
  - internal-error  # Unexpected internal error
  - syntax-check  # Ansible syntax check failed
Finished with 40 failure(s), 0 warning(s) on 195 files.

we need to have a look at these warnings/errors and fix them.

Prepare a 'release' checklist and scripts to properly cut `ci-artifacts` releases

Can you guys also think of a release management and testing plan for ci-artifacts? For eg. How often will we cut a release ? What will be the criteria for cutting a release .. etc

we have a template for that in NFD, maybe we could use the same for ci-artifacts
https://github.com/kubernetes-sigs/node-feature-discovery/blob/master/.github/ISSUE_TEMPLATE/new-release.md

Create a must-gather image for the GPU Operator

OpenShift allows capturing key information about the cluster with the must-gather command. This command allows passing a custom image, eg:

oc adm must-gather --image=quay.io/kubevirt/must-gather:latest --dest-dir=/tmp/must

See this document for an explanation about the design, the main script and the secondary scripts.

The requirement from the image are simple:

To provide your own must-gather image, it must....

Must have a zero-arg, executable file at /usr/bin/gather that does your default gathering

Must produce data to be copied back at /must-gather. The data must not contain any sensitive data. We don't string PII information, only secret information.

Must produce a text /must-gather/version that indicates the product (first line) and the version (second line, major.minor.micro.qualifier), so that programmatic analysis can be developed.

GPU Operator

Deploy from OperatorHub
- allow deploying an older version #76

toolbox/gpu-operator/deploy_from_operatorhub.sh
toolbox/gpu-operator/undeploy_from_operatorhub.sh

Deploy from helm

toolbox/gpu-operator/list_version_from_helm.sh
toolbox/gpu-operator/deploy_with_helm.sh <helm-version>
toolbox/gpu-operator/undeploy_with_helm.sh

Deploy from a custom commit.

toolbox/gpu-operator/deploy_from_commit.sh <git repository> <git reference> [gpu_operator_image_tag_uid]
Example: 
toolbox/gpu-operator/deploy_from_commit.sh https://github.com/NVIDIA/gpu-operator.git master

Run the GPU Operator deployment validation tests

toolbox/gpu-operator/run_ci_checks.sh

Run GPU Burst to validate that the GPUs can run workloads
Capture GPU operator possible issues (entitlement, NFD labelling, operator deployment, state of resources in gpu-operator-resources, ...)
- already partly done inside the CI, but we should improve the toolbox aspect

NFD

Deploy the NFD operator from OperatorHub:

toolbox/nfd/deploy_from_operatorhub.sh
toolbox/nfd/undeploy_from_operatorhub.sh

Control the channel to use from the command-line
Test the NFD deployment #78
- test with the NFD if GPU nodes are available
- wait with the NFD for GPU nodes to become available #78

toolbox/nfd/has_gpu_nodes.sh
toolbox/nfd/wait_gpu_nodes.sh

Cluster

Add a GPU node on AWS

./toolbox/scaleup_cluster.sh

Specify a machine type in the command-line, and skip scale-up if a node with the given machine-type is already present

./toolbox/scaleup_cluster.sh <machine-type>

Entitle the cluster, by passing a PEM file, checking if they should be concatenated or not, etc. And do nothing is the cluster is already entitled

toolbox/entitlement/deploy.sh --pem /path/to/pem
toolbox/entitlement/deploy.sh --machine-configs /path/to/machineconfigs
toolbox/entitlement/undeploy.sh
toolbox/entitlement/test.sh
toolbox/entitlement/wait.sh

Capture all the clues required to understand entitlement issues

toolbox/entitlement/inspect.sh

Deployment of an entitled cluster
- already coded, but we need to integrate this repo within the toolbox
- deploy a cluster with 1 master node

CI

Build the image used for the Prow CI testing, and run a given command in the Pod

Usage:   toolbox/local-ci/deploy.sh <ci command> <git repository> <git reference> [gpu_operator_image_tag_uid]
Example: toolbox/local-ci/deploy.sh 'run gpu-ci' https://github.com/openshift-psap/ci-artifacts.git master

toolbox/local-ci/cleanup.sh

openshift-psap / ci-artifacts Goto Github PK

ci-artifacts's Introduction

Documentation

Dependencies

Prow CI

CI Dashboard

PSAP Operators Toolbox

Available Toolbox Commands

cluster

entitlement

gpu_operator

local_ci

nfd

nfd_operator

repo

sro

ci-artifacts's People

Contributors

Stargazers

Watchers

Forkers

ci-artifacts's Issues

GPU Operator

NFD

Cluster

CI

Recommend Projects

Recommend Topics

Recommend Org