Giter VIP home page Giter VIP logo

scality / metalk8s Goto Github PK

View Code? Open in Web Editor NEW
349.0 69.0 44.0 75.55 MB

An opinionated Kubernetes distribution with a focus on long-term on-prem deployments

License: Apache License 2.0

Python 14.88% Shell 1.47% Dockerfile 0.34% XSLT 0.01% Gherkin 0.75% Smarty 1.62% SaltStack 59.49% HTML 0.59% Scheme 0.01% JavaScript 0.84% CSS 0.07% Go 2.98% TypeScript 14.96% Mustache 0.88% Jinja 0.69% Makefile 0.40% Nix 0.01% MDX 0.02%
kubernetes kubernetes-cluster kubernetes-deployment kubernetes-setup kubernetes-monitoring k8s k8s-cluster k8s-deployer cloud cloud-native

metalk8s's Introduction

MetalK8s logo

An opinionated Kubernetes distribution with a focus on long-term on-prem deployments

Integrating

MetalK8s offers a set of tools to deploy Kubernetes applications, given a set of standards for packaging such applications is respected.

For more information, please refer to the Integration Guidelines.

Building

Prerequisites are listed here.

To build a MetalK8s ISO, simply type ./doit.sh.

For more information, please refer to the Building Documentation.

Contributing

If you'd like to contribute, please review the Contributing Guidelines.

Testing

Requirements

Bootstrapping a local environment

# Install virtualbox guest addition plugin
vagrant plugin install vagrant-vbguest
# Bootstrap a platform on a vagrant environment using
./doit.sh vagrant_up

End-to-End Testing

To run the test-suite locally, first complete the bootstrap step as outlined above, then:

# Run tests with tox
tox -e tests

Documentation

Requirements

Building

To generate HTML documentation locally in docs/_build/html, run the following command:

# Generate doc with tox
tox -e docs

MetalK8s version 1 is still hosted in this repository but is no longer maintained. The last release is MetalK8s 1.3.

metalk8s's People

Contributors

alexandre-allard avatar alexis-ld avatar aprucolimartins avatar bert-e avatar carlito-scality avatar cathydossantospinto avatar chengyanjin avatar cmonfort avatar dependabot[bot] avatar ebaneck avatar eg-ayoub avatar ezekiel-alexrod avatar gaudiauj avatar gdemonet avatar ghivert avatar hervedombya avatar jbertran avatar jbwatenbergscality avatar jeanmarcmilletscality avatar lucieleonard avatar monpote avatar nicolast avatar nootal avatar ognyan-kostadinov avatar sayf-eddine-scality avatar slaperche-scality avatar teddyandrieux avatar thomasdanan avatar wabernat avatar ycointe avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

metalk8s's Issues

Validate inter-node connectivity as a prerequisite

Early on in the deploy (and related) playbooks, where we check all nodes are reachable from the deployment host (ping), we should figure out a way to validate all nodes can talk to each other as well, e.g. to make sure no overly restrictive OpenStack security group is in place.

Monitor `etcd`-only nodes

Due to node_exporter being deployed as a DaemonSet, any node which is not part of the Kubernetes cluster isn't monitored. This can be the case for etcd nodes when they're not colocated with kube-node or kube-master roles.

We should

  • Deploy node_exporter on them as well (cfr. #31)
  • Ensure the Prometheus deployed with kube-prometheus is set up to also scrape metrics from these servers

Add platform checks

We need to check the following items after gathering the servers ansible facts

  • kernel version is >= 3.10.0-693.el7
  • docker-ce version is >= 1.12 and < 17.06
  • in /etc/daemon/docker.json
"storage-driver": "overlay2",
    "storage-opts": [
        "overlay2.override_kernel_check=true"
    ]
  • /etc/sysctl.d/may_detach_mounts.conf with content
fs.may_detach_mounts=1

Then verify that you don't run into the moby/moby#34538 issue

es-exporter metrics not captured

The Elasticsearch Prometheus metrics exposed by es-exporter are currently not being captured by Prometheus, due to restrictions in what's deployed with kube-prometheus. This is likely solved/relaxed by a newer version of kube-prometheus, though to be validated & tested.

Implement an upgrade test in CI

There are at least two scenarios to test:

  • Upgrade from some 'old' baseline version (e.g. 0.1.0 and stick to it) to proposed HEAD
  • Upgrade from the PR target version/branch to proposed HEAD

Make `kube_elasticsearch` deployment optional

Add a var, metal_k8s_enable_elasticsearch, which makes kube_elasticsearch optional, enabled by default. When disabled but enabled in a later run of the playbook, the services should be deployed. Also, delete the services when disabled again later (helm delete, kubectl delete,...)

Ensure etcd logs are ingested in Elasticsearch

We must make sure etcd logs get ingested in Elasticsearch. This may be the case on servers part of the k8s-cluster group where fluentd already accesses /var/log from the host (to be validated!), but not when etcd is deployed on a separate set of servers, where Kubernetes DaemonSet containers are not scheduled.

Use CoreDNS

Kubernetes 1.11 graduates the CoreDNS service to stable. Let's switch to it before MetalK8s 1.0.

Secure access to `kube-ops` services

Currently we deploy an Ingress object for various browser-based services in kube-ops. Anyone who has access to one of the kube-node servers (on port 80) can access these, which is obviously problematic from a security PoV. As an example, one can access container logs through Kibana.

I suggest we:

  • Remove these Ingress objects
  • Deploy two Role definitions in kube-ops: one which has access to metrics and alerts (Prometheus, Grafana and Alertmanager), another which has access to logs (Kibana) throught the API proxy. I'm not sure how to call these, bikeshedding allowed.
  • Add cluster-service labels on the service object
  • Adjust e.g. Kibana's SERVER_BASE_PATH to match the API proxy URL
  • Document how to access the services: running kubectl proxy, then use the correct service URLs

All of this should just work when using kubectl with admin.conf, yet would also permit more fine-grained access control once we deploy some kind of SSO/authn solution.

Monitor all system volumes with node_exporter

Currently node_exporter doesn't monitor all volumes available on the system: as https://github.com/prometheus/node_exporter/blob/1f11a86d594173ca1146ac1d1715cd6263e9959d/README.md#using-docker mentions, we'd need to bind all volumes in a node_exporter VM, which is impractical, most certainly because we're not setting up the node_exporter DaemonSet ourself...

It may be possible to work around this using MountPropagation and some other settings (but then see prometheus/node_exporter#672 and prometheus/node_exporter#660), which is far from ideal.

Maybe the best way forward would be to simply deploy node_exporter on all nodes, as it's intended to be deployed, and not let kube-prometheus (or prometheus-operator) manage it, just make sure the metrics are collected as intended.

Use IPVS for proxy

Kubernetes 1.11 graduates the IPVS backend for kube-proxy to stable. Let's switch to it before MetalK8s 1.0.

TLS Termination

(What's below is copied and slightly edited from an earlier e-mail thread)

In the Zenko requirements and delivery roadmap, there's some mention of TLS (among others for the Prometheus/Grafana dashboards).

In Kubernetes, TLS is often terminate by the ingress controller (in our deployments, likely Nginx). One can set annotations on an Ingress object to set up certificates, instruct the use of ECMA/LetsEncrypt,...

At the same time, there are solutions to manage certificates automatically, by declaring them as a resource, after which a controller will make sure keys are creates and signed, certificates are deployed as namespace-local secrets,... using ECMA or some (potentially self-signed) CA.

I think it could be a cool plus for the demo, showing the versatility/flexibility and existing features added to K8s, to also show this off, e.g. using cert-manager (which can be deployed using Helm) and a self-signed CA, or ECMA if we can get a floating IP and DNS set up somehow.

`etcd` memory sizing

I just ran into an issue with my etcd cluster (running on separate VMs, 3 nodes) where the etcd processes were being killed by the kernel OOM reaper because they hit the cgroup's memory limit, which was set at 512MB. The VMs have (supposedly) 4GB of memory available.

Turns out, due to https://github.com/kubernetes-incubator/kubespray/blob/595e96ebf125823c04a5a7f77e002e5a6affb9f2/roles/etcd/defaults/main.yml#L42, the memory limit is set to 512MB, because apparently Ansible decides the host has less than 4GB available (free -m says the same thing).

Bumping the memory limit to 1024MB (by setting etc_memory_limit in my inventory group_vars:etcd) ensures etcd keeps running. Should we raise the default? Size this differently?...

This also shows we should invest some time in proper monitoring of the etcd cluster...

Backup management

(What's below is copied and slightly edited from an earlier e-mail thread)

For on-prem K8s deployments, we should think about 'backup': which cluster information should be backed up? How do we achieve this?...

AFAIK all that's needed is to back-up the content of the etcd cluster used as a datastore for master nodes, not necessarily etcd-events, and the etcd data stored for the CNI system (Calico in our case, likely). If I'm not mistaken this is stored in the very same etcd cluster as used by kubeapi/master nodes.

Review storage provisioning

Related to #1

We currently pre-provision a couple of LVM2 Logical Volumes of specific sizes and inject these as PersistentVolumes resources. This is a somewhat inflexible approach.

Though we likely should keep using LVM2 as a base storage technology, using e.g. CSI (as suggested by @Zempashi) could make this somewhat more flexible. CSI is, however, not yet entirely stable.

There's an LVM2 CSI driver from MesosSphere we may want to use or contribute to (https://github.com/mesosphere/csilvm, https://mesosphere.com/blog/open-source-storage-ecosystem/). There's another plugin at https://github.com/wavezhang/k8s-csi-lvm which seems less polished.

Stabilize `helm init`

We've seen some cases where Helm fails to install in the cluster (helm init returns, but Tiller is not properly deployed).

The root cause is currently unknown, but we may want to deploy Tiller without using helm init, but --dry-run and use kubectl instead, as well as wait for Tiller to be running before continuing (TBD).

Finalize integration of `ansible-hardening`

There's an old PR (#11) which integrates the OpenStack Ansible Hardening role to tighten system security according to the STIG rules.

It'd be good to finalize this work, and integrate e.g. version 17.0.4 of this project, using the current vendoring system, and come up with a sensible configuration.

The `services.yml` playbook fails due to some `Undefined` variable which can't be JSON-encoded

TASK [kube_prometheus : copy kube-prometheus values into temporary file] *********************************************************************************************************************************************************************
fatal: [metalk8s-master-01 -> 10.200.4.36]: FAILED! => {
    "changed": false,
    "msg": "AnsibleError: Unexpected templating type error occurred on (deployExporterNode: False\ngrafana:\n  extraVars:\n    - name: 'GF_SERVER_ROOT_URL'\n      value: '%(protocol)s://%(domain)s/api/v1/namespaces/kube-ops/services/kube-prometheus-grafana:http/proxy/'\n  service:\n    labels:\n      kubernetes.io/cluster-service: \"true\"\n      kubernetes.io/name: \"Grafana\"\n\nprometheus:\n  externalUrl: '/api/v1/namespaces/kube-ops/services/kube-prometheus:http/proxy/'\n  service:\n    labels:\n      kubernetes.io/cluster-service: \"true\"\n      kubernetes.io/name: \"Prometheus\"\n  replicaCount: 2\n{% if kube_prometheus_secret|default %}\n  secrets:\n{% for secret in kube_prometheus_secret %}\n  - {{ secret }}\n{% endfor %}\n{% endif %}\n  storageSpec:\n    volumeClaimTemplate:\n      spec:\n        accessModes: [\"ReadWriteOnce\"]\n        resources:\n          requests:\n            storage: {{ prometheus_storage_size }}\n\nalertmanager:\n  externalUrl: '/api/v1/namespaces/kube-ops/services/kube-prometheus-alertmanager:http/proxy/'\n  service:\n    labels:\n      kubernetes.io/cluster-service: \"true\"\n      kubernetes.io/name: \"Alertmanager\"\n  replicaCount: 2\n\nexporter-kube-etcd:\n  etcdPort: 2379\n  endpoints: {{ groups.etcd | map('extract', hostvars, ['ansible_default_ipv4', 'address'])|list|to_json }}\n  scheme: https\n  # Linked to the secret of prometheus\n{% if exporter_kube_etcd_certFile|default %}\n  certFile: {{ exporter_kube_etcd_certFile }}\n{% endif %}\n{% if exporter_kube_etcd_keyFile|default %}\n  keyFile: {{ exporter_kube_etcd_keyFile }}\n{% endif %}\n): Undefined is not JSON serializable"
}

NO MORE HOSTS LEFT ***************************************************************************************************************************************************************************************************************************

or less verbose:

Undefined is not JSON serializable

Productize ElasticSearch deployment

The current ElasticSearch deployment is not production-ready:

  • It's not using stateful storage / deployed as a StatefulSet
  • JVM memory settings are low
  • Only one CPU assigned to services
  • No PodDistuptionBudgets deployed
  • Curator config is very basic
  • Fluentd config is very basic

Some of these are already handled by the vendored deployment files, other things need to be manually changed.

See #27

Investigate using `fluent-bit` instead of `fluentd`

Using fluent-bit instead of fluentd could be useful since it's more tailored for Kubernetes environments, and supports e.g. parser definitions through Pod annotations (remember our discussion in SF, @Zempashi)

See e.g. the Pods suggest a parser through a declarative annotation section in https://www.linux.com/blog/event/kubecon/2018/4/fluent-bit-flexible-logging-kubernetes

It can also be configured to auto-detect JSON payloads in log messages and explode them.

References

Design a 'framework' for automated tests

The current tests are fairly basic Bash scripts using bash_unit. This makes implementing more complicated tests (e.g. validating whether certain metrics are tracked) more complex and reduces code-reuse.

Also, being able to define high-level scenarios BDD/Cucumber style is impossible.

We should decide which mechanism we want to use to implement various types of tests, including deployment scenarios (install, upgrade, re-deploy,...), various tests to run against a deployment (to validate its behaviour), and maybe 'unit'-testing of our Ansible roles (e.g. using https://github.com/metacloud/molecule).

Storage design

(What's below is copied and slightly edited from an earlier e-mail thread)

As discussed with during the meeting earlier today, we'll need a story w.r.t. storage for the demo platform and, obviously, also for production deployments later on.

In general, when no kind of NAS which can be dynamically provisioned is available, and we're 'stuck' with local storage (on all servers comprising the cluster, or maybe even on some), In general, I think
there are 2.5 possible approaches:

  • Deploy a NAS with a dynamic volume provider ourselves
    • either host-based
    • or hyperconverged
  • Use local storage as it is: local-only

In the current architecture, our stateful services are all clustered, and don't need shared (NAS-style) storage in order for Pods comprising them to be schedulable on other nodes to continue operations. As such, a local storage solution should suit our needs, without the operational headache and overhead of a shared solution.

Next up: how to provision these volumes.

One possibility is to simply create empty directories under some folder, and deploy the local-storage provisioner. I'm not overly fond of this solution, because it doesn't allow for capacity isolation: a volume bound to an ElasticSearch Pod which is solely used for log ingestion could cause disruption of production services (other Pods) running on the same node. Less than desirable.

In 'real' on-prem deployments, we could aim for an architecture based on one-disk-per-volume, or a 'physical' partition per volume, prepared by some Ansible job (where we should really use by-UUID rules in fstab and associated mount-points such that losing fstab isn't lethal ;-)).

Alternatively, for both 'physical' as well as VM deployment (i.e. the demo platform), we could bundle 'similar' (SSD vs HDD etc) disks into LVM VGs, then create LVs according to our needs, and use them as in the scenario above.
@ballot-scality mentioned using thin provisioning in this case, I'm however not sure that's the right approach, since it'd require constant monitoring of the platform to ensure the thin pool doesn't run out of
space...

In the actual-volumes cases, I think we can't really use the local-volume provisioner though: this provisioner uses statfs to figure out the 'real' size of a volume mounted under the path it monitors, and creates a PV accordingly. It is, however, difficult (if possible at all) to create a disk/LV and FS of exactly the desired size we pre-define in our Charts. As a result, a used would need to override Chart values which define the desired PVC size request according to the specific cluster deployment, which is undesirable.

Instead, we should (at least, IMHO) pre-provision disks/LVs/FSs of the size we need (given the defaults in Chart values), then have an Ansible task POST the relevant PVs after the K8s cluster has been deployed, i.e. good old static provisioning.

There's some initial work to enhance the story around dynamic provisioning of local storage volumes, see kubernetes-retired/external-storage#651 and the pointers by the people who provided some input. These features, however, will only land into K8s 1.11 the earliest (though we may want to contribute to the efforts, I have some concerns about the current design which I'll raise in the PR later), so this won't be of any use in the foreseeable future.

To summarize, my current proposal would be to:

  • Let a user list the disks he wants to attribute to the K8s cluster in the Ansible configuration, as some kind of dict (per-node):
my-vg:
  drives: ['/dev/vdb', '/dev/vdc']
  storageClassName: local-ssd
  provisionedVolumeSizes:
    - 50Gi
    - 5Gi
    - 10Gi

(up to @Zempashi , @alxf and @ballot-scality to tell me how this is properly done in Ansible ;-))

  • Create PVs, VGs and LVs accordingly
  • Create some FS on the LVs (TBD which)
  • Create /mnt/my-vg/$UUID for every volume to be provisioned, add to fstab, mount
  • Deploy K8s
  • POST an SC for every SC defined, also setting the right scheduling options
  • POST PVs for every volume provisioned, including the correct node affinity rules etc, of the defined size (i.e. not using the size as reported by the FS, which may be slightly smaller)

By default, we'd pre-create PVs for all stateful services and their PVCs we're aware of.

We should check with TS people whether they feel comfortable using LVM2 for this purpose. I see, however, no reasonable way to achieve this otherwise.

CNI

(What's below is copied and slightly edited from an earlier e-mail thread)

The default CNI provider deployed by Kubespray is Calico, out of many possible CNI implementations. To be honest, I'm losing sight a bit across all possible choices, and what the differences/pros/cons between them are.

Does anyone have an opinion on which CNI implementation to use, and why? Should Calico suffice?

Also, do we stick with iptables as the routing mechanism for kube-proxy, or do we go for IPVS (which is likely what we want to use, but are less familiar with)?

Lots of questions, little answers :) Looking for your experiences and insights!

Integrate an OIDC provider

Instead of provisioning a single user at deployment time, we should use K8s' OIDC authn support, through an OIDC provide we provision as part of MetalK8s, and which (initially) uses its own user-base. Later on, we can integrate with other authn systems like LDAP, AD,...

Cfr. #83 #84

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.