berops / claudie Goto Github PK

Cloud-agnostic managed Kubernetes

License: Apache License 2.0

Makefile 0.72% Go 90.11% Smarty 7.24% Jinja 0.16% Dockerfile 1.76%

hybrid-cloud kubernetes managed-kubernetes multi-cloud hybridcloud multicloud cloud-native k8s kubernetes-cluster devops

claudie's People

Contributors

Stargazers

Watchers

claudie's Issues

Bug: CD pipeline is not updating deployments after successful CI

Current Behavior:

Deployments remain unchanged after a successful CI pipeline.

Log from Github actions:

Run kustomize build | kubectl apply -f -
configmap/urls-5mdg2m4f2g unchanged
service/context-box unchanged
service/kube-eleven unchanged
service/mongodb unchanged
service/terraformer unchanged
service/wireguardian unchanged
deployment.apps/builder unchanged
deployment.apps/context-box unchanged
deployment.apps/kube-eleven unchanged
deployment.apps/mongodb unchanged
deployment.apps/scheduler unchanged
deployment.apps/terraformer unchanged
deployment.apps/wireguardian unchanged

Expected Behavior:

After successful CI, all deployments should be changed in CD.

Steps To Reproduce:

Trigger CI/CD pipeline

Anything else to note:

According to the job definition everything should work fine, so I have no idea where can be the problem.

I don’t know if it's related to this issue but right now CD is triggered after every successful CI. Shouldn’t be better to have a trigger after the merge to master? (Check the definition please)

Feature: Minimise Image sizes

Motivation:

After trying to run services in docker, i notices that the docker image sizes were too large for some of the serves.

Description:

After a quick glance at the respective Dockerfiles, i saw that we are using the the images used to build the service while a couple of then are using a scratch image with build files copied from the builder image. I am not sure why some of Dockerfiles are not following the practise.

I think we can have a discussion if we should take this task ahead, keeping priority of other task in mind.

Feature: Create Unit tests

Feature: Create unit tests

Motivation:
We want to test the basic functionality of a platform before running end-to-end test

Description:
Run unit tests in the CI pipeline to decide whether the platform is ready to be deployed and tested by end-to-end test

Exit Criteria:

Reduce scope of shared state (Config) between services
Create a set of unit tests for platform services
Add the unit testing to CI pipeline

Firewalls in Hetzner Cloud

Motivation
All our networking setups should be protected by the network firewall.

Description
Currently the Hetzner Cloud provider doesn't specify any firewalls. Analyze which networking ports should be open and close all the remaining ones. Feel free to inspire yourself by the GCP provider.

Exit Criteria

A set of needed communication ports is defined
A firewall rules are created that respect the previous point ☝️

Feature: Code Linter

Motivation:

We should adhere to the same coding style so that reading somebody else's code won't be a pain.

Description:

We should choose, configure and integrate a code linter so that we ensure that the code is not a mix of various coding styles.

Exit criteria:

Choose Golang linter
Agree with the team on the set of linting rules
Integrate the linter into the PR workflows
Refactor existing code to comply with the linter

Bug: Removing the worker node fails when draining the node

Current Behavior:

When running more complex E2E tests, I noticed that claudie has a problem when we are adding a master node and at the same time deleting the worker nodes. The builder throws error Error while draining node testset-cluster-name1-1s2yvwn-hetzner-compute-hetzner-4 : exit status 1

Expected Behavior:

The worker node should be deleted.

Steps To Reproduce:

Create a more complex test set i.e. 2M/2W -> 3M/3W -> 2M/4W -> 3M/2W
Run claudie in any environment
Run the testing-framework
Wait until last change (3M/2W)

Anything else to note:

This bug was spotted in the PR #110, but after running same test set on the master branch (before merge), the error was there as well.

Feature: Parallel clusters creation workflow

Motivation:

Terraformer, Wireguardian, KubeEleven services should work on clusters in separate threads, dramatically decreasing work time.

Description:

Right now, when one of the services receives config, it will do its work cluster by cluster. Where we are aiming is a parallel workflow for each cluster in config.
Each service should wait for the completion of all clusters before it sends the config file back to Builder

Exit criteria:

Parallelism in Terraformer is implemented
Parallelism in Wireguardian is implemented
Parallelism in KubeEleven is implemented
Functionality of the solution is well tested on various use-cases

Feature: Use a logging package

Motivation:

The logging solution present in the Golang standard library (the 'log' package) is missing features like structured logging, log levels etc. Using a proper Go logging library would be beneficial.

Description:

The standard library log package will no longer be used for logging. The code will use a dedicated logging package.

Exit criteria:

Add and use a logging package

Feature: POC for LoadBalancing

Child tasks from #44.

Motivation:

We need to find a feasible LB setup for multi-cloud and hybrid-cloud deployments. In order to be able to start the implementation of the LBs in platform, we want to run a POC on the architectural setup.

Description:

This task is open to figure a way how to deploy LBs for K8s API and Ingress controller(s). Then to run POC of such a setup and to run basic tests on how it's gonna behave. Once we find a working mode, we should assess whether the LB architecture is gonna work for hybrid-cloud setups as well.

Exit criteria:

Architectural proposal on hybrid-cloud LBs
Functional tests are defined and documented
Completion of the functional tests of the LB configuration as per the proposal
Assessment of the proposal suitability also for the multi-cloud setups

Bug: Unable to delete master node from KubeOne

Current Behavior:

While deleting master node, in KubeEleven module an error from kubeOne occurs. Deleting worker node works fine.
kubeOne apply -f manifest.yaml

apply - Reconcile the cluster

INFO[11:13:56 -05] Electing cluster leader...
ERRO[11:13:57 -05] Failed to elect leader.
ERRO[11:13:57 -05] Quorum is mostly like lost, manual cluster repair might be needed.
ERRO[11:13:57 -05] Consider the KubeOne documentation for further steps.
WARN[11:13:57 -05] Task failed, error was: leader not elected, quorum mostly like lost

Expected Behavior:

KubeOne apply should pass without any error

Steps To Reproduce:

Create a new cluster by saving config with at least 2 master nodes and wait for a cluster to be created.
Decrement master nodes count in manifest.yaml file
Save config again

Make all deployments in HA (2-replica per deployment)

Motivation
In order to run the platform in a highly-available mode, we need to make sure that the platform can run in a 2+ replica-mode for all the microservices (this way if a node with replica A is down, the service is served from another node hosting replica B). If the situation allows, the scheduler typically doesn't deploy replicas of the same deployment into the same node.

Description
Analyze which workloads can be deployed in a 2-replica-mode right now (e.g. scheduler, builder,...) and apply the manifests.

Exit Criteria

Rework the manifests of all the relevant deployments into 2-replica mode
When the end-to-end test is available, ensure that there are no issues resulting from parallelism on any of the components

Use Golang Terraform API in place of Terraformer shell calls

Motivation
At the moment, we're calling Terraform via Shell, but there exists a Golang-native API for Terraform.
Using that instead could be cleaner.

Description
Refactor all usages of Terraform shell calls into using a Golang-native API for Terraform.

Exit Criteria

All usages of Terraform shell calls are refactored into using a Golang-native API for Terraform

Bug: Wireguardian is misconfiguring private address

Current Behavior:

When a user wants to add a node to an existing cluster, Wirreguardian will sometimes assign another IP address on an existing node with an already given IP. There is no check implemented to prevent this behavior. Currently, Wirreguardian is issuing IP addresses according to how they are arranged in the slice.

Expected Behavior:

Wireguardian should add existing private IP addresses to existing nodes in the generated Ansible inventory file.

Steps To Reproduce:

In local environment run all services
Create a cluster and wait for the process to finish
Add node to the manifest.yaml and run the process again

Feature: Initialise each Service as a go module.

Description:

Currently, the root project is initialised as a go modules which keeps track of all the packages being used by all the service. At the time of building, We use same go.mod and go.sum(available at the root of the project) to download packages for our services, which causes us to download packages which are not required by a service. Instead of this(IMO), we should define each service as a individual go module to track the packages used by that service only.

however i am not sure if go build picks up only the required packages at the time of compilation. Would love to know more on this.

Feature: Using docker-compose to spin up all service in containerised environment.

Motivation:

Currently we have to run each service in individual shell, making a tedious and prone to error task. Also, we are running go run command which is good for quick development, but i feel we should run your service in containerised environment since the services are being deployed as containers.

Docker swarm is also an option in my opinion, it will be a bit overkill.

Description:

We need to start with writing a docker-compose file and configuring services. Since we are already using the docker for building our images, we'll focus on docker-compose file only.

Exit criteria:

Write docker-compose.yml
test out different configs for network, start-up order

Feature: Remove `claudie` namespace

Motivation:

This is a clean-up task in order to keep things simple.

Description:

Clean-up claudie namespace as we don't have any use-case for it. This will simplify our environment, so that we decrease the cognitive load a little.

Exit criteria:

claudie namespace is cleaned-up from the cluster and all the manifests
claudie-<sha256> namespaces continue working properly and are still part of the CI runs

Bug: Testing framework is unable to delete built clusters

Current Behavior:

After the successful run of the testing framework, the cluster the Claudie creates is not deleted

Expected Behavior:

After the successful run of the testing framework, the cluster the Claudie creates, should be deleted

Steps To Reproduce:

Run the platform in a dev cluster
Start the testing framework

Anything else to note:

Error messaged returned to testing framework:

2021/09/03 13:27:09 Deleting the clusters from test set: tests/test-set1
2021/09/03 13:27:09 Error while processing tests/test-set1 : rpc error: code = Unknown desc = error while calling DestroyInfrastructure on Terraformer: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp: address :tcp://10.98.3.204:50052: too many colons in address"
    platform_test.go:67: 
        	Error Trace:	platform_test.go:67
        	Error:      	Received unexpected error:
        	            	rpc error: code = Unknown desc = error while calling DestroyInfrastructure on Terraformer: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp: address :tcp://10.98.3.204:50052: too many colons in address"
        	Test:       	TestPlatform

Feature: Make all cloud providers part of the testing framework

Motivation:

This is important to keep a constant check if all the supported platforms are working as intended or not. The current testing framework only cater hetzner as the provider. The decision to go with hetzner nodes only, was taken in order to save the CI/CD time as hetzner is quick when it comes to spawning nodes and configure network (as compared to gcp).

Exit criteria:

Creating a test-set which contain cluster(s) created from nodes provisioned by all the supported providers

Feature: Import develop_test_bucket into Berops/infra Terraform

Motivation:

It seems that the currently used DEV environment bucket for Claudie, develop_test_bucket, has been created via ClickOps, because it seems to be absent from Berops/infra Terraform.
We generally always want to create all infrastructure via Terraform.

The objective of this task is to integrate that GCS bucket into the Terraform code.

Description:

Exit criteria:

develop_test_bucket is managed by Terraform in Berops/infra

Bug: Caller does not know if the Builder is working properly

Current Behavior:

When a manifest is being processed, in a case where Builder unexpectedly fails and dies, Claudie does not detect this and continues to wait.

Expected Behavior:

When a manifest is being processed, in a case where Builder unexpectedly fails and dies, the Claudie should detect it and should terminate due to error.

Steps To Reproduce:

Run all services in any environment
Run the testing-framework
Intentionally stop Builder
The testing-framework will not stop and will continue waiting until the testing framework timeout

Anything else to note:

This might be happening with Scheduler as well since both of these services are implemented as clients. We need to find an optimal solution for checking if the client services are running or not.

Feature: Shorter naming convention for temporary namespaces in the CD

Motivation:

Currently the naming conventions for the temporary namespaces takes the whole commit sha string and uses that as a suffix (claudie-<sha>). That means that all the temp namespaces are long with too many chars that are difficult to follow. Shall we shorten them?

Description:

The idea would be to take a shorter version of the git commit sha, ideally just 6-7 chars (e.g. short commit hashes have just 7 chars).

Exit criteria:

Temporary namespaces are created with short commit hash
Temporary namespaces contain the build number (to enable re-running of the pipeline for the same commit)

Improve Error Handling

Motivation
Application should be able to recover from the errors it meets on the execution path. It should not crash.
At the moment, the application simply crashes.

Description
Analyze the code and add a code for capturing and processing errors. At minimum each error should be logged with a message. On top of that, extra recovery might be needed in some cases (e.g. cleanup temporary structures like queues).

Exit Criteria

All errors in ContextBox are being taken care of
All errors in Scheduler are being taken care of
All errors in Builder are being taken care of
All errors in Terraformer are being taken care of
All errors in Wireguardian are being taken care of
All errors in KubeEleven are being taken care of

Wireguardian Ansible script node restart problem

I have already made a solution as a part of my diploma thesis so this bug should be easy to fix.

Create a list of functions and review it together
Create and implement utils package

Feature: Add support for log level verbosity

The log level verbosity should allow developers to choose from different levels of verbosity for logs, i.e. debug level, dev level, etc. This should be easily configurable, meaning switching between levels of verbosity should be as easy as setting one variable.

Support for other cloud providers

Motivation
The more public cloud providers the platform supports, the better are its chances of success as a multi-cloud solution.

Add more public cloud providers as options for NodePools, LBs...

Feature: LoadBalancing

Motivation:

We need to ensure that certain entry-points to the cluster will be highly available.

Description:

Every K8s cluster needs to have loadbalancers at least for Ingress and K8s API traffic. Eventually this should be easily configurable for other types of LBs (TCP, UDP, Nodeports) The catch here is that the solution needs to work:

in hybrid-cloud setups
in multi-cloud setups

We might need to explore the possibilities of using LB-as-a-Service (from cloud-providers) vs home-made (e.g. haproxy) deployments. Understand the availability of this solution.

Exit criteria:

Configuration pattern has been designed
API docs are updated
Basic tests are written
LB functionality is implemented

Nodepools Refactoring

If I remember correctly, there was a doubt raised by @samuelstolicny & @bernardhalas about nodepools.
I don't exactly remember what that was about - please clarify, fill in the Issue, or let's discuss it first - as you prefer.

Motivation
TODO

Description
TODO

Exit Criteria

TODO

Feature: Secret Management Solution

Motivation:

We need to ensure that the secrets we use in the platform are stored and managed in a secure manner.

Description:

Analyze the needs the platform has on the secret management solution. Based on the needs, propose and implement a secret management solution.

Exit criteria:

Claudie's needs are known
The proposal for secret management has been discussed and agreed within the team
The agreed proposal has been implemented and integrated

Feature: Run CD pipeline after the merge to master branch

Motivation:

Right now, the CD pipeline is called right after every successful CI. This approach is not considered the best because the codebase from PR will propagate to the dev deployment aka "claudie namespace" in our dev cluster after PR is created not after it's considered as working and merged to the master branch.

Description:

Run CD pipeline after the merge to master, not after successful CI.

Path to the CD pipeline YAML: .github/workflows/CD-pipeline-dev.yml
Current trigger:

on:
  # Run after CI
  workflow_run:
    workflows: [ "CI pipeline for platform" ] 
    types: [ completed ]
  # Manual trigger    
  workflow_dispatch:

Exit criteria:

Change CD pipeline trigger

This one should be really quick so I consider it as a "good first issue", but let's groom it first.

Feature: Parametrize the Terraformer bucket

Motivation:

At the moment, Claudie uses a hardcoded S3 (GCS) bucket with a hardcoded name, which is presupposed to already exist.
In the future, we'll want to rework this into one or more of the following approaches:

The user already has a bucket which they want Claudie to use, and the user inputs the name of the bucket into the Claudie input config. Claudie then uses that existing bucket to store the Terraformer TF state.
The user doesn't have an S3 bucket, but they'd want one in the cloud, and indicates that in the Claudie input config. Terraformer first creates the bucket using a locally-held TF state, and then the locally-held TF state is migrated to the created bucket.
The user doesn't have an S3 bucket, and the user doesn't want one in the cloud. We could make Minio a new component of the Claudie platform, running alongside all the other containers. Minio would create an S3 bucket which Claudie would then use. This seems to be the most flexible solution, also befitting on-premises setups.

First, we'll have to agree on an approach.

Exit criteria:

Agree on a general strategy for where Claudie is going to hold TF state. Is S3 really the best option?
Agree on a general strategy for how the TF state storage is going to be created (see above)
Implement the chosen strategy

Bug: Builder queue is not updating properly

Current Behavior:

The Context-box will receive the first set of new configs. They will be added to the Scheduler queue and the Scheduler will process them normally. After they have been processed by the Scheduler, they are added to the Builder queue, to be processed by Builder. Builder will process them and all looks good.

Then the Context-box will receive a second set of configs. They will be added to the Scheduler queue and they will be processed by a Scheduler normally. After that, they are NOT added to a Builder queue. Only after the restart of the Context-box pod, the Builder queue is updated.

Expected Behavior:

The Builder queue should be updated after the second set of processed configs from Scheduler

Steps To Reproduce:

Run the Claudie in Dev cluster
Run TestSaveConfigFrontEnd in /services/context-box/client/client_test.go
Wait for config to be processed by Scheduler and Builder
Run TestSaveConfigFrontEnd again
Wait for config to be processed by Scheduler
See logs of Context-box -> Builder queue not updated

Anything else to note:

I have changed TestSaveConfigFrontEnd to add 15 new configs each time the test runs. (instead of one new config)
I have modified Builder so it is just a dummy Builder for testing concurrency -> it is not processing the config, it has time.Sleep

log.Println("I got config: ", config.GetName())
//config = callTerraformer(config)
//config = callWireguardian(config)
//config = callKubeEleven(config)
time.Sleep(60 * time.Second)              //dummy "work"
config.CurrentState = config.DesiredState // Update currentState

Feature: Loadbalancing

Motivation:

Within Claudie we need LB-as-a-Service for making endpoints highly-available.

Description:

Following the LB POC #54 , implement a solution for building on-demand LB clusters. The confguration syntax is described in #39 , which makes it a pre-requisite. Please, bear in mind, that a single LB cluster may be used for multiple services.

Exit criteria:

Implement LB-as-a-Service within Claudie's needs
Define relevant tests for LB setup validation

Bug: Missing provider credentials in gcp.tpl

Current Behavior:

Credentials are misconfigured in gcp.tpl file in Terraformer module. Currently, it's working because terraform uses GCP credentials from backend.tpl and is skipping gcp.tpl credentials.

Expected Behavior:

gcp.tpl should consist functional path to GCP credentials because if we choose to use a different backend for Terraform, GCP provider will stop working.

Bug: GCP master nodes throws error in kubeEleven

Current Behaviour:

KubeEleven fails to create a k8s cluster and goes into endless cycle of calling an api on APIEndpoint which keeps returning 404. Whereas with hetzner, the k8s cluster is created successfully.

Expected Behaviour:

KubeEleven should be able to create a k8s cluster with GCP nodes as control nodes.

Steps To Reproduce:

Prepare a manifest file with gcp nodes as control(master) nodes
Run the platform using the docker compose.
Pass the config using the tests under server/context-box/client/client_test.go

Anything else to note:

The issue is on master branch . Here are the debug.log from my run.

User interfaces and APIs

Motivation
Define interfaces (user or robotic ones) through which the clients can interact with platform. I assume we will need

config-file injection (for internal purposes, so that we can benefit from the platform quickly)
Terraform SDK (for IaC definition for our clients, this will be an evolution from the config-file injection)
HTTP REST API (for frontend GUI and robotic approach)

Bug: gRPC returns nil config while error

Current Behavior:

While working on the concurrency task I stumbled upon this interesting finding. gRPC returns nil with error. Let me briefly explain with our codebase.

Expected Behavior:

config, err = callWireguardian(config)
	if err != nil && config != nil {
		config.CurrentState = config.DesiredState // Update currentState
		// save error message to config
		config.ErrorMessage = err.Error()
		errSave := cbox.SaveConfigBuilder(c, &pb.SaveConfigRequest{Config: config})
		if errSave != nil {
			return fmt.Errorf("error while saving the config: %v", err)
		}
		return fmt.Errorf("error in Wireguardian: %v", err)
	}

Currently, if Wireguardian fails, we expect that it returns an error message with config back to Builder to set currentState and ErrorMessage and save it to the DB. But, because of specific gRPC's implementation, it always returns nil err (it is automatically generated in pb.proto file)

func (c *wireguardianServiceClient) BuildVPN(ctx context.Context, in *BuildVPNRequest, opts ...grpc.CallOption) (*BuildVPNResponse, error) {
	out := new(BuildVPNResponse)
	err := c.cc.Invoke(ctx, "/platform.WireguardianService/BuildVPN", in, out, opts...)
	if err != nil {
		return nil, err
	}
	return out, nil
}

Surprisingly I didn't find this in Terraformer. 🤔
A possible solution can be using rich error model.

Steps To Reproduce:

In any environment run test-framework
Intentionally create an error in Terraformer, Wireguardian, KubeEleven.

Anything else to note:

Some useful links that I found about this topic:
https://stackoverflow.com/questions/61949913/why-cant-i-get-a-non-nil-response-and-err-from-grpc
https://stackoverflow.com/questions/48748745/pattern-for-rich-error-handling-in-grpc
https://grpc.io/docs/guides/error/#richer-error-model

Bug: Component versions of our stack are not pinned.

Current Behavior:

Development environment is not fully configured via the code. In order for the environment to be brought up, the administrator of the environment needs to manually obtain several tools like Ansible, Terraform and KubeOne. The versioning of this toolset is not yet managed.

Expected Behavior:

The project contains a description of the versions of tools it depends on and for which the testing and implementation is valid.

Steps To Reproduce:

N/A

Anything else to note:

KubeOne version pinning will be done via #38
Terraform version pinning will be done via #36
Wireguard and Ansible need to be sorted out

Set log verbosity

Motivation:
We need to be able to switch into log debug mode in case of any issue is occurring in order to be able to collect more data.

Description:
Rework all logging messages so that they include log verbosity definition. Feel free to consider using Log4Go library (or similar): https://github.com/jeanphorn/log4go.

Exit Criteria:

Propose and agree with the rest of the team on the method for managing log verbosity
Define and describe log-levels
Include the preferred logging library into the code of all microservices
Rework all present log messages calls into the calls for the new library
Add switch for selecting a log level; also make sure that no value here will be choosing some sensible defaults

Feature: Claudie inputconfig input validation

Motivation:

At the moment, Claudie doesn't validate the inputconfig in any way whatsoever.
There should probably be some form of input validation, so that the user gets feedback if they mess something up.

Exit criteria:

Agree on which component should do the input validation, and to what extent
Implement the input validation as agreed upon

Feature: Move the production cluster from GCP Autopilot to GKE

Motivation:

In order to save resources, we would like to move our production cluster (where the platform runs) from google autopilot to the GKE solution. The main reason is that autopilot is currently is not a suitable solution for our workload.

Description:

The task includes migration of current deployments to the GKE solution with autoscaling and persisting it as IaC( infrastructure as a code).

Use private address and load balancer (the reason is more described in #43).

For IaC use Terraform. The best place should be our infra repository.

Exit criteria:

Create Terraform manifests for GKE solution with

Private IP addresses
External connectivity
Node auto-scaling
Migrate current deployment by making changes to current CI/CD pipelines

Use the Golang KubeOne API instead KubeOne shell calls

Motivation
At the moment, we're calling KubeOne via Shell, but there may exist a Golang-native API.
Using that instead could be cleaner.

Description
Refactor all usages of KubeOne shell calls into using a Golang-native API for KubeOne.

Exit Criteria

All usages of KubeOne shell calls are refactored into using a Golang-native API for KubeOne

Feature: POC for Storage System

Child tasks from #50

Motivation:

We need to figure out a storage system to enable stateful workloads. This POC should find a suitable strategy using one of the storage solutions described below.

Description:

Consider the following storage solutions:

ceph + rook
longhorn
glusterFS
storageOS
chubaoFS
openEBS

Focus on exploring the following strategy:

Orchestrate storage on the k8s cluster nodes by creating one storage cluster across multiple providers. This storage cluster will have a series of "zones", one for each cloud provider. Each zone should store its own persistent volume data.

Explore additional strategies if the one above turns out to be inappropriate/infeasible:

create a storage cluster in each cloud provider and mirror the data between all storage clusters
create one storage cluster located in one cloud provider
- machines on other providers will pull data from this cluster

Exit criteria:

A storage solution is chosen
A storage strategy is chosen

End-to-end tests

Motivation:
We need to be able to test the platform's functionality in a production-like environment.

Description:
Deploy the platform on a Kubernetes cluster and figure out how to run end-to-end tests on the current functionality of the platform.

Exit Criteria:

Deploy the platform on a Kubernetes cluster
Figure out how to send SaveConfigFrontEnd message to Context-box
End-to-end test should be a part of the CI pipeline
#59
Define a set of end-2-end tests

On-premise node support

Motivation
We need to be able to support on-premise clusters and hybrid-cloud clusters.

Feature: Resource requests/limits for Claudie K8s Deployments and Autoscaler tuning

Motivation:

At the moment, we're having issues due to pods not being schedulable, and GKE being totally helpless:

 Warning  FailedScheduling   2m17s (x28 over 41m)   gke.io/optimize-utilization-scheduler  0/8 nodes are available: 7 Insufficient memory, 8 Insufficient cpu.
  Normal   NotTriggerScaleUp  2m14s (x227 over 38m)  cluster-autoscaler                     pod didn't trigger scale-up (it wouldn't fit if a new node is added): 5 in backoff after failed scale-up

Furthermore, the pods actually get assigned some default resource requests/limits, but they are probably ridiculously high for the dev environment.

          resources:
            limits:
              cpu: 500m
              ephemeral-storage: 1Gi
              memory: 2Gi
            requests:
              cpu: 500m
              ephemeral-storage: 1Gi
              memory: 2Gi

Description:

We should define resource requests for each K8s deployment, with sensible values.
That will help us avoid future problems with the pods being unschedulable, and it will enable us to fit more pods into less nodes.

Exit criteria:

All K8s deployments have a resources: section specified with sensible values based on real resource usage. Relevant MR is reviewed and merged.
Check the cluster autoscaler behavior and tuning

berops / claudie Goto Github PK

claudie's People

Contributors

Stargazers

Watchers

Forkers

claudie's Issues

Current Behavior:

Expected Behavior:

Steps To Reproduce:

Anything else to note:

Motivation:

Description:

Feature: Create unit tests

Motivation:

Description:

Exit criteria:

Current Behavior:

Expected Behavior:

Steps To Reproduce:

Anything else to note:

Motivation:

Description:

Exit criteria:

Motivation:

Description:

Exit criteria:

Motivation:

Description:

Exit criteria:

Current Behavior:

Expected Behavior:

Steps To Reproduce:

Current Behavior:

Expected Behavior:

Steps To Reproduce:

Description:

Motivation:

Description:

Exit criteria:

Motivation:

Description:

Exit criteria:

Current Behavior:

Expected Behavior:

Steps To Reproduce:

Anything else to note:

Motivation:

Exit criteria:

Motivation:

Description:

Exit criteria:

Current Behavior:

Expected Behavior:

Steps To Reproduce:

Anything else to note:

Motivation:

Description:

Exit criteria:

Wireguardian Ansible script node restart problem

Current behavior

Expected behavior

Steps to reproduce

Motivation:

Description:

Exit criteria:

Motivation:

Description:

Exit criteria:

Motivation:

Description:

Exit criteria:

Motivation:

Description:

Exit criteria:

Motivation:

Description:

Exit criteria:

Feature: Add support for log level verbosity

Motivation: