sapcc / helm-charts Goto Github PK

Helm charts for SAP Converged Cloud managing openstack on kubernetes

License: Apache License 2.0

Smarty 84.64% Shell 8.22% Go 0.13% Dockerfile 0.15% Mustache 4.07% Makefile 0.52% Open Policy Agent 0.68% Lua 0.62% HTML 0.73% Python 0.24%

openstack helm kubernetes

helm-charts's Introduction

SAP Converged Charts

This repository contains Helm charts required by SAP Converged Cloud.

Structure

Charts are grouped logically into:

common: Reusable charts
global: Singletons that only exist once in a global context
openstack: Openstack and dependent or related services
prometheus-exporters: A curated collection of Prometheus exporters
prometheus-rules: Prometheus alert- & aggregation rules
system: Infrastructure required by the control plane

This structure is just a logical grouping, it does not represent deployable units or imply other semantics.

Charts

On the second level we expect a chart. This can be a single chart or a meta-chart that describe a dependent set of compononents. Meta-charts contain sub-charts or reference charts from other repositories using Helm dependencies.

.
└── system
    ├── dns
    │   └── charts
    │       ├── bind
    │       └── unbound
    ├── kube-system
    │   └── charts
    │       ├── ingress
    │       └── dashboard
    └── prometheus
        └── charts
            ├── kube-state-metrics
            ├── prometheus-collector
            └── prometheus-frontend

We imply that the highest chart will be deployed as a Helm release. In this example, releasing dns will install/update bind and unbound.

In order to be able to relate charts to running Kubernetes pods, we also imply that a chart will be deployed in a namespace with the same name.

$ kubectl get pods --all-namespaces                                                                                                                 0 ↵
NAMESPACE         NAME                                               READY     STATUS    RESTARTS   AGE
dns               bind1-2290429089-joidj                             2/2       Running   0          5d
dns               bind2-3590597799-1vcv0                             2/2       Running   0          5d
dns               unbound1-3007389427-shh2y                          1/1       Running   0          9d
dns               unbound1-3577488147-ld1rd                          1/1       Running   0          5d
kube-system       ingress-controller-d3snv                           1/1       Running   4          13d
kube-system       ingress-controller-j9bpf                           1/1       Running   2          18d

This has the benefits that:

Values required for releasing a chart can be found at the same place in cc/regions
Cleanup of a failed release, is as easy as deleting the namespace.
For testing a chart can deployed in a seperate testing namespace.
Pods and other Kubernetes primitives are reflected at a known place in Kubernetes

Test a Chart

Opening a PR to this repository triggers the Helm chart tests which are described in detail here.

Install/Update of a Chart/Release

Per convention we use the name of the meta-chart as namespace and name of the release. Values are pulled in from a secret repository.

helm upgrade dns ./system/dns --namespace dns --values ../secrets/staging/system/dns.yaml --install

helm-charts's People

Contributors

Stargazers

Watchers

helm-charts's Issues

remove static registry reference

remove https://github.com/sapcc/helm-charts/search?utf8=%E2%9C%93&q=hub.global from public chart values and replace with private values to be passed during release installation

[nannies] Replace Global Master Password

https://github.com/sapcc/helm-charts/tree/master/openstack/nannies

Remove all references to Values.global.master_password
Prefer to not reference a global value. Instead require a password being passed to the chart

Prior Art:
a4c48f4

Guideline for deploying Swift

Hi team.
I have been operating openstack on k8s cluster deployed by Openstack-helm project. (https://opendev.org/openstack/openstack-helm).
This project includes several main services of openstack and all are packaged by helm charts.
Unfortunately, it does not provide swift yet. So I am trying to build swift container image and package the helm chart. In Openstack-helm, all container images are created by loci project(https://opendev.org/openstack/loci) but I don't think I need to build in that way.
I am interested in your containerization method and want to discuss the details. I am willing to contribute also.

Regards.

Originally posted by @QuesadaMarvin in #2386 (comment)

Move kubernetes-entrypoint to init-container

Personally, I like the approach openstack/openstack-helm to move the kubernetes-entrypoint to the init-container.

This way, we do not have to add the executable to the image, and we can use a minimal kubernetes-entrypoint container to fulfil the same purpose.

Prometheus relabeling should decide which snmp-exporter module to use

For the baremetal metrics the service discovery file gives that information via __param_module, but it shouldn’t.
Instead the Prometheus should make that decision during relabeling based on information the service discovery gives via labels.
The reasoning behind that is, that the service discovery should be reusable for other setups and for scenarios that have nothing to do with the snmp-exporter.

[digicert-issuer]: Migrate CustomResourceDefinitions to v1

For the upcoming k8s upgrade to 1.22 we need to migrate CustromResourceDefinitions to v1.

For some more details please check the 1.22 Deprecation Guide

There was also a post in the kubernetes blog about upcoming changes in 1.22:

Migrate to use the CustomResourceDefinition apiextensions.k8s.io/v1 API, available since v1.16.
You can use the v1 API to retrieve or update existing objects, even if they were created using an older API version. If you defined any custom resources in your cluster, those are still served after you upgrade.
If you're using external CustomResourceDefinitions, you can use kubectl convert to translate existing manifests to use the newer API. Because there are some functional differences between beta and stable CustomResourceDefinitions, our advice is to test out each one to make sure it works how you expect after the upgrade.

In the case of external resources it might be best to pull in an updated version of the crds from upstream.

Happy hacking!

how can this pattern be used to deploy multiple instances of same chart

I want 4 instances of mysql chart for separate applications, how can I achieve this using SAP converged pattern?

in Deis workflow method, in requirements.yaml I can launch 4 instances of a chart using alias, is there similar way available.

Bugs in kube-monitoring

Remove AlertManager Releases/Trash from Regions. Should only be a single global instance.
Node Exporter terminates as “Completed”. Needs to be restarted.
Grouping for PodRestart Alerts doesn’t work properly. Too much spam during resolves. Remove or Fix.

[ucs-exporter] Replace Global Master Password

https://github.com/sapcc/helm-charts/tree/master/prometheus-exporters/ucs-exporter

Remove all references to Values.global.master_password
Prefer to not reference a global value. Instead require a password being passed to the chart

Prior Art:
a4c48f4

Proactively Detect Kubelet Problems - Go-Routine Leaks

Recently we have been seeing some instances of "unresponsive" Kubelets. The symptoms are Pods being scheduled but not starting and related problems. Our current alerting doesn't detect this. This is because the Kubelet is actually still running and responsive.

My suspicion is that it gets stuck in some endless retry loop or the like. A restart fixes the problem. Pending finding out the actual root cause and fixing the bug, we need to have an alert so we can proactively fix the problem.

One possible way to detect this would be to find abnormal spikes in the number of Go Routines the kubelet is creating.

This query shows a recent incident.
https://prometheus.staging.cloud.sap/graph?g0.range_input=1w&g0.expr=go_goroutines%7Bjob%3D%22kube-system%2Fkubelet%22%7D&g0.tab=0

Normal on eu-de-1:

Abnormal on staging:

Implement an alert when GoRoutines are spiking.

[cert-manager-crds-scaleout]: Migrate CustomResourceDefinitions to v1

For the upcoming k8s upgrade to 1.22 we need to migrate CustromResourceDefinitions to v1.

For some more details please check the 1.22 Deprecation Guide

There was also a post in the kubernetes blog about upcoming changes in 1.22:

Migrate to use the CustomResourceDefinition apiextensions.k8s.io/v1 API, available since v1.16.
You can use the v1 API to retrieve or update existing objects, even if they were created using an older API version. If you defined any custom resources in your cluster, those are still served after you upgrade.
If you're using external CustomResourceDefinitions, you can use kubectl convert to translate existing manifests to use the newer API. Because there are some functional differences between beta and stable CustomResourceDefinitions, our advice is to test out each one to make sure it works how you expect after the upgrade.

In the case of external resources it might be best to pull in an updated version of the crds from upstream.

Happy hacking!

[openstack-exporter] Replace Global Master Password

https://github.com/sapcc/helm-charts/tree/master/prometheus-exporters/openstack-exporter

Remove all references to Values.global.master_password
Prefer to not reference a global value. Instead require a password being passed to the chart

Prior Art:
a4c48f4

[vertical-pod-autoscaler]: Migrate CustomResourceDefinitions to v1

For the upcoming k8s upgrade to 1.22 we need to migrate CustromResourceDefinitions to v1.

For some more details please check the 1.22 Deprecation Guide

There was also a post in the kubernetes blog about upcoming changes in 1.22:

Migrate to use the CustomResourceDefinition apiextensions.k8s.io/v1 API, available since v1.16.
You can use the v1 API to retrieve or update existing objects, even if they were created using an older API version. If you defined any custom resources in your cluster, those are still served after you upgrade.
If you're using external CustomResourceDefinitions, you can use kubectl convert to translate existing manifests to use the newer API. Because there are some functional differences between beta and stable CustomResourceDefinitions, our advice is to test out each one to make sure it works how you expect after the upgrade.

In the case of external resources it might be best to pull in an updated version of the crds from upstream.

Happy hacking!

[neutron/sftp] Replace Global Master Password

https://github.com/sapcc/helm-charts/blob/master/openstack/neutron/templates/sftp-deployment.yaml

Remove all references to Values.global.master_password
Prefer to not reference a global value. Instead require a password being passed to the chart

Prior Art:
a4c48f4

[prometheus-crds]: Migrate CustomResourceDefinitions to v1

For the upcoming k8s upgrade to 1.22 we need to migrate CustromResourceDefinitions to v1.

For some more details please check the 1.22 Deprecation Guide

There was also a post in the kubernetes blog about upcoming changes in 1.22:

Migrate to use the CustomResourceDefinition apiextensions.k8s.io/v1 API, available since v1.16.
You can use the v1 API to retrieve or update existing objects, even if they were created using an older API version. If you defined any custom resources in your cluster, those are still served after you upgrade.
If you're using external CustomResourceDefinitions, you can use kubectl convert to translate existing manifests to use the newer API. Because there are some functional differences between beta and stable CustomResourceDefinitions, our advice is to test out each one to make sure it works how you expect after the upgrade.

Happy hacking!

Add klog_pod_oomkill metrics

Make the klog_pod_oomkill metric provided by the oomkill-exporter available in all Prometheus instances (similar to the kube_* metrics). This would allow pod monitoring and alerting to be implemented in the dedicated Prometheus.

[cinder] Clean out old agents

The migration strategy negotiates the oldest version over all agents.
Having old agents from prior deployments breaks that, as they never renegotiate their version and will never update it.

One way to proceed would be by time:
delete from services where updated_at is null or updated_at < now() - interval '15 minutes';

Update Node Exporter Specs for Diskstats

Implement this change in the node-exorter daemonset. Maybe the disk stats will be more usable then...

prometheus/node_exporter@21173e2

Bonus Points: Remove the /host prefix in an aggregation rule.

[utils/mysql] Replace Global Master Password

https://github.com/sapcc/helm-charts/blob/master/openstack/utils/templates/_hosts.tpl#L130

Remove all references to Values.global.master_password
Prefer to not reference a global value. Instead require a password being passed to the chart

Prior Art:
a4c48f4
Take note that this utility function is being used in many other charts:

utils.password_for_fixed_user_mysql
- utils.password_for_user_mysql
  - utils.root_password
    - db_url_mysql
      - https://github.com/sapcc/helm-charts/blob/master/openstack/nova/templates/bin/_db-update-cells.tpl
      - https://github.com/sapcc/helm-charts/blob/master/openstack/barbican/templates/etc/_barbican.conf.tpl
      - https://github.com/sapcc/helm-charts/blob/master/openstack/octavia/templates/etc/_octavia.conf.tpl
      - https://github.com/sapcc/helm-charts/blob/master/openstack/neutron/templates/etc/_neutron.conf.tpl
      - https://github.com/sapcc/helm-charts/blob/master/openstack/designate/templates/etc/_designate.conf.tpl

Alert KubernetesApiServerLatency Bugged

Since the upgrade to the latest Prometheus this Alert is bugged. It reports the wrong metric and constantly alerts with false data.

prometheus-operator/prometheus-operator#343
kubernetes/kubernetes#44329

I propose we exclude LIST as well.

[swift] get rid of endpoint override patch

sapcc/keystonemiddleware@f4bf856 seems to bring a better option to force the use of the public keystone interface for token validations than that https://github.com/sapcc/swift/blob/stable/rocky-m3/docker/keystonemiddleware-token-validation-interface.patch

[Prometheus] Info Inhibitors

Currently severity=critical inhibits severity=warning when the same context is set for alerts.

Add an additional inhibition rule: severity=critical|warning supresses severity=info

[utils/identity] Replace Global Master Password

https://github.com/sapcc/helm-charts/blob/master/openstack/utils/templates/_hosts.tpl#L136

Remove all references to Values.global.master_password
Prefer to not reference a global value. Instead require a password being passed to the chart

Prior Art:
a4c48f4

Take note that this utility function is being used in many other charts:

utils.password_for_fixed_user_and_host
- identity.password_for_user
- svc.password_for_user_and_service
  - oslo_messaging_rabbit_url (unused)

Coordinated deployments might be required. :/

Implement shared mount propagation in swift containers

Blocked by pending upstream PR kubernetes/kubernetes#41683

Once merged:

pass /srv/node into storage service containers as rshared or rslave mount
remove manual umount propagation from swift-drive-autopilot and storage containers

Increase Severity for NodeNotReady Alerts

The NodeNotReady alert needs to be treated with urgency.

It is indicative of the node being stuck with a hanging kernel. This leads to problems if locks are still being held for persistent applications. Upon rescheduling those applications will not recover and stay in CrashLoopBackoff until manual intervention. Which could lead to severe outage if critical databases, like Keystone, are affected.

Increase the severity of the alert to critical. Update Playbook.

[rabbitmq] Replace Global Master Password

https://github.com/sapcc/helm-charts/tree/master/common/rabbitmq

Remove all references to Values.global.master_password
Prefer to not reference a global value. Instead require a password being passed to the chart

Prior Art:
a4c48f4

Grafana Improvements kube-monitoring

Add Node Uptime to Health Dashboards. Need to see when last restart happened
Shorten OS Label in Health Dashboard

MetalIronicMetricsDown: set severity to info if maintenance true

MetalIronicMetricsDown should not be a warning if maintenance has been enabled

helm-charts/system/kube-monitoring/charts/prometheus-frontend/metal-ironic.alerts

Line 46 in 215a7fa

- alert: MetalIronicMetricsDown

Tuning of kube-monitoring

PodRestart are to spammy as Warnings. Set to INFO Level
Critical Alerts for when regional Prometheus are down are too aggressive. Relax timeframe
Send Warnings to regional channels
Critical Alerts should also send resolved notifications
Docker Hang Alert is not sensitive enough

Bash Script vs Executable

Instead of executing the script with an explicit call to bash, I would suggest to mount all container-init scripts with executable mode, and call the script directly.

This way, we encapsulate the execution in the script, and can chose another interpreter (e.g. dumb-init bash), if required.

fix statsd_exporter handling of sampled timer metrics

Upstream issue: prometheus/statsd_exporter#57 (observed with swift-object-server)

implement fix
submit PR
when accepted, build a new statsd_exporter image that includes this fix

[pgmetrics] Replace Global Master Password

https://github.com/sapcc/helm-charts/tree/master/common/pgmetrics

Remove all references to Values.global.master_password
Prefer to not reference a global value. Instead require a password being passed to the chart

Prior Art:
a4c48f4

Swift-health-statsd as proxy side-car

In order to check md5 sums for rings and config each proxy needs it own running swift-health-statsd. (see discuusion here: sapcc/swift-health-statsd#2

Not able to deploy on the local k8s cluster

Hi team. I am trying to deploy openstack services(swift for the first time) on my local k8s cluster. When I run helm install, it returns the following error message.

Error: execution error at (swift/templates/workers-daemonset.yaml:33:54): This release should be installed by the deployment pipeline!

It seems likely the values.yaml file includes some hardcoded constants which will be replaced in CICD pipeline.
How can I use these helm charts? Where are the container images for them?

[cert-manager-crds]: Migrate CustomResourceDefinitions to v1

For the upcoming k8s upgrade to 1.22 we need to migrate CustromResourceDefinitions to v1.

For some more details please check the 1.22 Deprecation Guide

There was also a post in the kubernetes blog about upcoming changes in 1.22:

Migrate to use the CustomResourceDefinition apiextensions.k8s.io/v1 API, available since v1.16.
You can use the v1 API to retrieve or update existing objects, even if they were created using an older API version. If you defined any custom resources in your cluster, those are still served after you upgrade.
If you're using external CustomResourceDefinitions, you can use kubectl convert to translate existing manifests to use the newer API. Because there are some functional differences between beta and stable CustomResourceDefinitions, our advice is to test out each one to make sure it works how you expect after the upgrade.

In the case of external resources it might be best to pull in an updated version of the crds from upstream.

Happy hacking!

[velero]: Migrate CustomResourceDefinitions to v1

For the upcoming k8s upgrade to 1.22 we need to migrate CustromResourceDefinitions to v1.

For some more details please check the 1.22 Deprecation Guide

There was also a post in the kubernetes blog about upcoming changes in 1.22:

Migrate to use the CustomResourceDefinition apiextensions.k8s.io/v1 API, available since v1.16.
You can use the v1 API to retrieve or update existing objects, even if they were created using an older API version. If you defined any custom resources in your cluster, those are still served after you upgrade.
If you're using external CustomResourceDefinitions, you can use kubectl convert to translate existing manifests to use the newer API. Because there are some functional differences between beta and stable CustomResourceDefinitions, our advice is to test out each one to make sure it works how you expect after the upgrade.

In the case of external resources it might be best to pull in an updated version of the crds from upstream.

Happy hacking!

[disco]: Migrate CustomResourceDefinitions to v1

For the upcoming k8s upgrade to 1.22 we need to migrate CustromResourceDefinitions to v1.

For some more details please check the 1.22 Deprecation Guide

There was also a post in the kubernetes blog about upcoming changes in 1.22:

Migrate to use the CustomResourceDefinition apiextensions.k8s.io/v1 API, available since v1.16.
You can use the v1 API to retrieve or update existing objects, even if they were created using an older API version. If you defined any custom resources in your cluster, those are still served after you upgrade.
If you're using external CustomResourceDefinitions, you can use kubectl convert to translate existing manifests to use the newer API. Because there are some functional differences between beta and stable CustomResourceDefinitions, our advice is to test out each one to make sure it works how you expect after the upgrade.

In the case of external resources it might be best to pull in an updated version of the crds from upstream.

Happy hacking!

prometheus-global retention time

$ grep retention global/prometheus/values.yaml
2:retention: 168h0m0s

@BugRoger @auhlig I recall this being a lot more (something like 90 days or so). Was there a copy-paste error, or did we scale it down over storage space concerns?

Region Label Missing

For some alerts the region label seems to be missing:

This leads to odd effects in routing, grouping etc... And the rendering of the alerts looks bugged 😄

I think this might be because some alert queries remove all labels. In Prometheus speak it ends up with something like absent(up{job="kube-scheduler"})={}. Somehow these rules don't apply then:

https://github.com/sapcc/helm-charts/blob/master/system/kube-monitoring/charts/prometheus-frontend/templates/config.yaml#L91-L98

According to the documentation this should add a label though, so not sure what's going on.

[jaeger-operator]: Migrate CustomResourceDefinitions to v1

For the upcoming k8s upgrade to 1.22 we need to migrate CustromResourceDefinitions to v1.

For some more details please check the 1.22 Deprecation Guide

There was also a post in the kubernetes blog about upcoming changes in 1.22:

Migrate to use the CustomResourceDefinition apiextensions.k8s.io/v1 API, available since v1.16.
You can use the v1 API to retrieve or update existing objects, even if they were created using an older API version. If you defined any custom resources in your cluster, those are still served after you upgrade.
If you're using external CustomResourceDefinitions, you can use kubectl convert to translate existing manifests to use the newer API. Because there are some functional differences between beta and stable CustomResourceDefinitions, our advice is to test out each one to make sure it works how you expect after the upgrade.

In the case of external resources it might be best to pull in an updated version of the crds from upstream.

Happy hacking!

remove host references from config

remove https://github.com/sapcc/helm-charts/blob/master/global/prometheus/templates/config.yaml#L40
need to go into secret values.

Postgres maintenance scripts are not executed

We are using the vanillla postgres containers, and they do not execute the postgres.dbMaintain scripts.

[Prometheus] Increase of scrape duration

The number of open FDs of Prometheus in staging is significantly higher than in the other regions. Highest in production found in eu-de-1. As we only see an increase in scrape duration in staging we may want to try -storage.local.series-file-shrink-ratio={0.3,..,0.5} there to reduce consumed disk throughput as suggested in the other thread @BugRoger.

Use clearer prometheus-maia sd ns restriction

Reminder to self to use prometheus/prometheus#2642 for maia prometheus in the future.

sapcc helm repo cannot be added because of certificate signed by unknown authority

Somehow I cannot add the helm repo to my helm list. It says that the certificate is not verified.

$ helm repo add sapcc https://charts.global.cloud.sap
Error: looks like "https://charts.global.cloud.sap" is not a valid chart repository or cannot be reached: Get "https://charts.global.cloud.sap/index.yaml": x509: certificate signed by unknown authority

I tried to skip TLS verify but it says it's "AuthorizedOnly"

$ helm repo add --insecure-skip-tls-verify sapcc https://charts.global.cloud.sap
Error: looks like "https://charts.global.cloud.sap" is not a valid chart repository or cannot be reached: failed to fetch https://charts.global.cloud.sap/index.yaml : 403 AuthorizedOnly

What can I do to add the helm repo?

[mysql_metrics] Replace Global Master Password

https://github.com/sapcc/helm-charts/tree/master/common/mysql_metrics

Remove all references to Values.global.master_password
Prefer to not reference a global value. Instead require a password being passed to the chart

Prior Art:
a4c48f4

Remove all references to Values.global.master_password
Prefer to not reference a global value. Instead require a password being passed to the chart

Prior Art:
a4c48f4

Mounting of Config-Files

I would suggest, that instead of copying the config-files in scripts, we rather mount them into a subpath, like it is done in openstack/openstack-helm.

sapcc / helm-charts Goto Github PK

helm-charts's Introduction

SAP Converged Charts

Structure

Charts

Test a Chart

Install/Update of a Chart/Release

helm-charts's People

Contributors

Stargazers

Watchers

Forkers

helm-charts's Issues

Recommend Projects

Recommend Topics

Recommend Org