krkn-chaos / krkn-hub Goto Github PK

Containerized wrapper around https://github.com/krkn-chaos/krkn to inject failures into Kubernetes clusters with minimal configuration.

License: Apache License 2.0

Dockerfile 16.90% Shell 72.66% Python 10.44%

kubernetes openshift chaos-engineering resiliency reliability performance

krkn-hub's Issues

Add In CI to validate kraken-hub scenarios

Add in Ci to validate kraken-hub scenarios and environment variables

Add demo to the readme

We need to add a short demo of Kraken-hub in the github readme, this will help with giving a quick overview of capabilities of the tooling without having to go over the docs. https://asciinema.org/ might help with this.

Add support for metrics collection and evaluation

Kraken-hub needs to support the metrics collection and evaluation features in Kraken to be able to capture metrics of interest from in-cluster prometheus and also evaluate them to determine pass/fail: https://github.com/cloud-bulldozer/kraken#scraping-and-storing-metrics-long-term and https://github.com/cloud-bulldozer/kraken#alerts.

CPU/MEM Hog scenarios - pod generated in default namespace is not cleaned/removed

When we trigger a cpu/mem hog scenario, one pod is scheduled in the default namespace. After the scenario is completed, this pods keeps lying there and isn't removed.

In my view, this pod should be removed/cleaned once the scenario test is completed.

To verify, after the scenario is completed run $ oc get pods -n default

Pod-scenarios configuration needs to be fixed

https://github.com/redhat-chaos/krkn-hub/blob/main/pod-scenarios/pod_scenario.yaml.template needs to be replaced now that we use Arcaflow based pod-scenarios: krkn-chaos/krkn#280.

Litmus - misspelled variable name

Misspelled variable name in the following link:
https://github.com/redhat-chaos/krkn-hub/blob/main/docs/node-memory-hog.md#supported-parameters

Says LIMTUS_INSTALL (vs LITMUS_INSTALL)

[ERROR] scenario: scenarios/pvc_scenario.yaml failed with exception: <class 'UnboundLocalError'> file: /root/kraken/kraken/pvc/pvc_scenario.py line: 141

scenarios/pvc_scenario.yaml failed with exception: <class 'UnboundLocalError'>

podman run -it --rm --name=disk --net=host --env-host=true -v $KUBECONFIG:/root/.kube/config -v $SCENARIO:/root/kraken/scenarios/pvc_scenario.yaml -d krkn-hub:pvc-scenarios

2024-07-17 02:25:10,381 [INFO] Starting kraken
2024-07-17 02:25:10,390 [INFO] Initializing client to talk to the Kubernetes cluster
2024-07-17 02:25:10,390 [INFO] Generated a uuid for the run: 232d86a6-04ad-4d5e-b5de-8187b0f8a239
2024-07-17 02:25:20,834 [INFO] Fetching cluster info
2024-07-17 02:25:22,498 [INFO] Cluster version is 4.12.32
2024-07-17 02:25:22,498 [INFO] Server URL: https://<abc.com>:6443
2024-07-17 02:25:22,499 [INFO] Daemon mode not enabled, will run through 1 iterations

2024-07-17 02:25:22,499 [INFO] Executing scenarios for iteration 0
2024-07-17 02:25:22,499 [INFO] Running PVC scenario
2024-07-17 02:25:22,501 [INFO] Input params:
pvc_name: ''
pod_name: 'virt-launcher-rodan-223249-137'
namespace: 'virtualmachines'
target_fill_percentage: '75%'
duration: '60s'
2024-07-17 02:25:43,240 [INFO] Volume name: os-disk
2024-07-17 02:25:43,241 [INFO] PVC name: rodan-223249-137-os
2024-07-17 02:25:43,241 [ERROR] scenario: scenarios/pvc_scenario.yaml failed with exception: <class 'UnboundLocalError'> file: /root/kraken/kraken/pvc/pvc_scenario.py line: 141

$ cat scenarios/pvc_scenario.yaml
pvc_scenario:
pvc_name:
pod_name: virt-launcher-rodan-223249-137
namespace: virtualmachines
fill_percentage: 75
duration: 60

Add support for network ingress traffic shaping scenarios

Kraken-hub should support ingress based network chaos scenarios now that Kraken supports it - krkn-chaos/krkn#299.

Add post chaos container check flags

Add ability to pass the wait time parameter to the container scenario in kraken hub

Update cpu, memory, io hog to arcaflow scenarios

After PR: krkn-chaos/krkn#395 gets merged we will need to update the pointers of the cpu, memory and io hog to the new scenarios and add the new parameters for this

Will want to take out any refernces to litmus

A small typo

In the document https://github.com/redhat-chaos/krkn-hub/blob/main/docs/cerberus.md#cerberus
The history port should be 8080, so:
"It exposes the go/no-go signal at http://0.0.0.0:8080/ and metrics API at http://0.0.0.0:8080/history."

By the way, the document is really helpful. Thank you for your sharing!

Automated image build system for pull requests

We need a mechanism in place to build container images based on the pull request commits to be able to test it for every PR instead of doing the same manually.

NOTE: https://github.com/arcalot/arcaflow-plugin-image-builder can be used as a reference for implementation.

power-outage scenario container image exits prematurely

container exits prematurely due to KUBECONFIG not set properly inside a container. But when i run container in /bin/bash i can see KUBECONFIG mounted under /root/.kube/config and not sure why unset KUBECONFIG is run as per logs below.

.
$ podman run -it --rm --name=power --net=host --env-host=true -v /tmp/kubeconfig-42:/root/.kube/config:Z quay.io/krkn-chaos/krkn-hub:power-outages

source /root/main_env.sh
++ export CERBERUS_ENABLED=False
++ CERBERUS_ENABLED=False
++ export CERBERUS_URL=http://0.0.0.0:8080
++ CERBERUS_URL=http://0.0.0.0:8080
++ export KRKN_KUBE_CONFIG=/root/.kube/config
++ KRKN_KUBE_CONFIG=/root/.kube/config
++ export WAIT_DURATION=60
++ WAIT_DURATION=60
++ export ITERATIONS=1
++ ITERATIONS=1
++ export DAEMON_MODE=False
++ DAEMON_MODE=False
++ export RETRY_WAIT=120
++ RETRY_WAIT=120
++ export PUBLISH_KRAKEN_STATUS=False
++ PUBLISH_KRAKEN_STATUS=False
++ export SIGNAL_ADDRESS=0.0.0.0
++ SIGNAL_ADDRESS=0.0.0.0
++ export PORT=8081
++ PORT=8081
++ export SIGNAL_STATE=RUN
++ SIGNAL_STATE=RUN
++ export DEPLOY_DASHBOARDS=False
++ DEPLOY_DASHBOARDS=False
++ export CAPTURE_METRICS=False
++ CAPTURE_METRICS=False
++ export ENABLE_ALERTS=False
++ ENABLE_ALERTS=False
++ export ALERTS_PATH=config/alerts
++ ALERTS_PATH=config/alerts
++ export ES_SERVER=http://0.0.0.0:9200
++ ES_SERVER=http://0.0.0.0:9200
++ export CHECK_CRITICAL_ALERTS=False
++ CHECK_CRITICAL_ALERTS=False
++ export KUBE_BURNER_URL=https://github.com/cloud-bulldozer/kube-burner/releases/download/v1.7.0/kube-burner-1.7.0-Linux-x86_64.tar.gz
++ KUBE_BURNER_URL=https://github.com/cloud-bulldozer/kube-burner/releases/download/v1.7.0/kube-burner-1.7.0-Linux-x86_64.tar.gz
++ export TELEMETRY_ENABLED=False
++ TELEMETRY_ENABLED=False
++ export TELEMETRY_API_URL=https://ulnmf9xv7j.execute-api.us-west-2.amazonaws.com/production
++ TELEMETRY_API_URL=https://ulnmf9xv7j.execute-api.us-west-2.amazonaws.com/production
++ export TELEMETRY_USERNAME=redhat-chaos
++ TELEMETRY_USERNAME=redhat-chaos
++ export TELEMETRY_PASSWORD=
++ TELEMETRY_PASSWORD=
++ export TELEMETRY_PROMETHEUS_BACKUP=True
++ TELEMETRY_PROMETHEUS_BACKUP=True
++ export TELEMTRY_FULL_PROMETHEUS_BACKUP=False
++ TELEMTRY_FULL_PROMETHEUS_BACKUP=False
++ export TELEMETRY_BACKUP_THREADS=5
++ TELEMETRY_BACKUP_THREADS=5
++ export TELEMETRY_ARCHIVE_PATH=/tmp
++ TELEMETRY_ARCHIVE_PATH=/tmp
++ export TELEMETRY_MAX_RETRIES=0
++ TELEMETRY_MAX_RETRIES=0
++ export TELEMETRY_RUN_TAG=chaos
++ TELEMETRY_RUN_TAG=chaos
++ export TELEMETRY_ARCHIVE_SIZE=1000
++ TELEMETRY_ARCHIVE_SIZE=1000
++ export TELEMETRY_LOGS_BACKUP=False
++ TELEMETRY_LOGS_BACKUP=False
++ export 'TELEMETRY_FILTER_PATTERN=["(\w{3}\s\d{1,2}\s\d{2}:\d{2}:\d{2}\.\d+).+","kinit (\d+/\d+/\d+\s\d{2}:\d{2}:\d{2})\s+","(\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}\.\d+Z).+"]'
++ TELEMETRY_FILTER_PATTERN='["(\w{3}\s\d{1,2}\s\d{2}:\d{2}:\d{2}\.\d+).+","kinit (\d+/\d+/\d+\s\d{2}:\d{2}:\d{2})\s+","(\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}\.\d+Z).+"]'
++ export TELEMETRY_CLI_PATH=
++ TELEMETRY_CLI_PATH=
++ export TELEMETRY_EVENTS_BACKUP=True
++ TELEMETRY_EVENTS_BACKUP=True
++ unset KUBECONFIG
source /root/env.sh
++ export SHUTDOWN_DURATION=1200
++ SHUTDOWN_DURATION=1200
++ export CLOUD_TYPE=aws
++ CLOUD_TYPE=aws
++ export TIMEOUT=180
++ TIMEOUT=180
++ export SCENARIO_TYPE=cluster_shut_down_scenarios
++ SCENARIO_TYPE=cluster_shut_down_scenarios
++ export 'SCENARIO_FILE=- scenarios/cluster_shut_down_scenario.yml'
++ SCENARIO_FILE='- scenarios/cluster_shut_down_scenario.yml'
++ export SCENARIO_POST_ACTION=
++ SCENARIO_POST_ACTION=
source /root/common_run.sh
config_setup
envsubst
checks
check_oc
log 'Checking if OpenShift client is installed'
++ date +%d-%m-%YT%H:%M:%S
echo -e '\033[1m10-07-2024T02:45:06 Checking if OpenShift client is installed\033[0m'
10-07-2024T02:45:06 Checking if OpenShift client is installed
which oc
alias
eval declare -f
++ declare -f
/usr/bin/which --tty-only --read-alias --read-functions --show-tilde --show-dot oc
/usr/local/bin/oc
check_kubectl
log 'Checking if kubernetes client is installed'
++ date +%d-%m-%YT%H:%M:%S
echo -e '\033[1m10-07-2024T02:45:06 Checking if kubernetes client is installed\033[0m'
10-07-2024T02:45:06 Checking if kubernetes client is installed
which kubectl
alias
eval declare -f
++ declare -f
/usr/bin/which --tty-only --read-alias --read-functions --show-tilde --show-dot kubectl
/usr/local/bin/kubectl
check_cluster_version
kubectl version
WARNING: This version information is deprecated and will be replaced with the output from kubectl version --short. Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.4", GitCommit:"33a7a8bcccdd1c7c0e2f51609d832d31232d2f26", GitTreeState:"clean", BuildDate:"2023-12-13T22:07:37Z", GoVersion:"go1.20.10 X:strictfipsruntime", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v5.0.1
Unable to connect to the server: EOF
log 'Unable to connect to the cluster, please check if it'''s up and make sure the KUBECONFIG is set correctly'
++ date +%d-%m-%YT%H:%M:%S
echo -e '\033[1m10-07-2024T02:45:17 Unable to connect to the cluster, please check if it'''s up and make sure the KUBECONFIG is set correctly\033[0m'
10-07-2024T02:45:17 Unable to connect to the cluster, please check if it's up and make sure the KUBECONFIG is set correctly
exit 1

simulate a disk failure on the cluster node (full or partial)

sometimes physical disk failure be it full or partial failure can bring down the overall IO performance of the cluster so is there a way to simulate disk failure in Kraken?

here partial failure means predictive failure or medium errors (a few sectors have gone bad) where the disk is still accessible by the kernel/fs/application.

Create common_run bash script for all run files to use

Want to add a common run bash file at the base level with common functions that each run.sh script will use to avoid duplication

Can start with the following functions but could be more

# Check if oc is installed
log "Checking if OpenShift client is installed"
which oc &>/dev/null
if [[ $? != 0 ]]; then
  log "Looks like OpenShift client is not installed, please install before continuing"
  log "Exiting"
  exit 1
fi

# Check if kubectl is installed
log "Checking if kubernetes client is installed"
which kubectl &>/dev/null
if [[ $? != 0 ]]; then
  log "Looks like Kubernetes client is not installed, please install before continuing"
  log "Exiting"
  exit 1
fi

# Check if cluster exists and print the clusterversion under test
kubectl get clusterversion
if [[ $? -ne 0 ]]; then
  log "Unable to connect to the cluster, please check if it's up and make sure the KUBECONFIG is set correctly"
  exit 1
fi

Add vmware node scenario support in krkn-hub

Kraken now support node scenarios for nodes/clusters in VMWare, it would be nice if we can add this support in krkn-hub to start leveraging node scenarios using krkn-hub wrapper.

Parameterize internal image names for node/mem/io hog scenarios to support disconnected environments

In the node/mem/io hog scenarios, below two images are called internally from the parent image i.e.,

quay.io/arcalot/arcaflow-plugin-kubeconfig:0.2.0
quay.io/arcalot/arcaflow-plugin-stressng:0.3.1

In disconnected environments, these images will be pulled from a connected host and mirrored on to a local registry. The image names will have to be changed in order to pull it from the local mirror, instead of Quay.

OCM/ACM chaos scenarios integration

Kraken now supports OCM/ACM chaos scenarios - krkn-chaos/krkn#370, we will need to get them into Kraken-hub as well to be able to run them using podman without having to carry around or tweak config files - especially useful for CI use case.

Node selector doesn't work for node memory hog scenario

While testing the node memory hog scenario, container always gets created on random node. When verified it is due to the input.yaml template file has selector hard coded as none {}.

https://github.com/redhat-chaos/krkn-hub/blob/main/node-memory-hog/input.yaml.template

Similar file in cpu hog scenario has the variable assigned in the yaml and is working as expected. This has to be updated so to make sure the required node has the memory stress.

krkn-chaos / krkn-hub Goto Github PK

krkn-hub's Issues

Recommend Projects

Recommend Topics

Recommend Org