berops / claudie Goto Github PK
View Code? Open in Web Editor NEWCloud-agnostic managed Kubernetes
Home Page: https://docs.claudie.io/
License: Apache License 2.0
Cloud-agnostic managed Kubernetes
Home Page: https://docs.claudie.io/
License: Apache License 2.0
Deployments remain unchanged after a successful CI pipeline.
Log from Github actions:
Run kustomize build | kubectl apply -f -
configmap/urls-5mdg2m4f2g unchanged
service/context-box unchanged
service/kube-eleven unchanged
service/mongodb unchanged
service/terraformer unchanged
service/wireguardian unchanged
deployment.apps/builder unchanged
deployment.apps/context-box unchanged
deployment.apps/kube-eleven unchanged
deployment.apps/mongodb unchanged
deployment.apps/scheduler unchanged
deployment.apps/terraformer unchanged
deployment.apps/wireguardian unchanged
After successful CI, all deployments should be changed in CD.
According to the job definition everything should work fine, so I have no idea where can be the problem.
I don’t know if it's related to this issue but right now CD is triggered after every successful CI. Shouldn’t be better to have a trigger after the merge to master? (Check the definition please)
After trying to run services in docker, i notices that the docker image sizes were too large for some of the serves.
After a quick glance at the respective Dockerfiles
, i saw that we are using the the images used to build the service while a couple of then are using a scratch image with build files copied from the builder image. I am not sure why some of Dockerfiles are not following the practise.
I think we can have a discussion if we should take this task ahead, keeping priority of other task in mind.
Motivation:
We want to test the basic functionality of a platform before running end-to-end test
Description:
Run unit tests in the CI pipeline to decide whether the platform is ready to be deployed and tested by end-to-end test
Exit Criteria:
Motivation
All our networking setups should be protected by the network firewall.
Description
Currently the Hetzner Cloud provider doesn't specify any firewalls. Analyze which networking ports should be open and close all the remaining ones. Feel free to inspire yourself by the GCP provider.
Exit Criteria
We should adhere to the same coding style so that reading somebody else's code won't be a pain.
We should choose, configure and integrate a code linter so that we ensure that the code is not a mix of various coding styles.
When running more complex E2E tests, I noticed that claudie has a problem when we are adding a master node and at the same time deleting the worker nodes. The builder throws error Error while draining node testset-cluster-name1-1s2yvwn-hetzner-compute-hetzner-4 : exit status 1
The worker node should be deleted.
testing-framework
This bug was spotted in the PR #110, but after running same test set on the master branch (before merge), the error was there as well.
Terraformer, Wireguardian, KubeEleven services should work on clusters in separate threads, dramatically decreasing work time.
Right now, when one of the services receives config
, it will do its work cluster by cluster. Where we are aiming is a parallel workflow for each cluster in config
.
Each service should wait for the completion of all clusters before it sends the config file back to Builder
The logging solution present in the Golang standard library (the 'log' package) is missing features like structured logging, log levels etc. Using a proper Go logging library would be beneficial.
The standard library log package will no longer be used for logging. The code will use a dedicated logging package.
Child tasks from #44.
We need to find a feasible LB setup for multi-cloud and hybrid-cloud deployments. In order to be able to start the implementation of the LBs in platform, we want to run a POC on the architectural setup.
This task is open to figure a way how to deploy LBs for K8s API and Ingress controller(s). Then to run POC of such a setup and to run basic tests on how it's gonna behave. Once we find a working mode, we should assess whether the LB architecture is gonna work for hybrid-cloud setups as well.
While deleting master node, in KubeEleven module an error from kubeOne occurs. Deleting worker node works fine.
kubeOne apply -f manifest.yaml
INFO[11:13:56 -05] Electing cluster leader...
ERRO[11:13:57 -05] Failed to elect leader.
ERRO[11:13:57 -05] Quorum is mostly like lost, manual cluster repair might be needed.
ERRO[11:13:57 -05] Consider the KubeOne documentation for further steps.
WARN[11:13:57 -05] Task failed, error was: leader not elected, quorum mostly like lost
KubeOne apply should pass without any error
Motivation
In order to run the platform in a highly-available mode, we need to make sure that the platform can run in a 2+ replica-mode for all the microservices (this way if a node with replica A is down, the service is served from another node hosting replica B). If the situation allows, the scheduler typically doesn't deploy replicas of the same deployment into the same node.
Description
Analyze which workloads can be deployed in a 2-replica-mode right now (e.g. scheduler
, builder
,...) and apply the manifests.
Exit Criteria
Motivation
At the moment, we're calling Terraform via Shell, but there exists a Golang-native API for Terraform.
Using that instead could be cleaner.
Description
Refactor all usages of Terraform shell calls into using a Golang-native API for Terraform.
Exit Criteria
When a user wants to add a node to an existing cluster, Wirreguardian will sometimes assign another IP address on an existing node with an already given IP. There is no check implemented to prevent this behavior. Currently, Wirreguardian is issuing IP addresses according to how they are arranged in the slice.
Wireguardian should add existing private IP addresses to existing nodes in the generated Ansible inventory file.
Currently, the root project is initialised as a go modules which keeps track of all the packages being used by all the service. At the time of building, We use same go.mod and go.sum(available at the root of the project) to download packages for our services, which causes us to download packages which are not required by a service. Instead of this(IMO), we should define each service as a individual go module to track the packages used by that service only.
however i am not sure if go build
picks up only the required packages at the time of compilation. Would love to know more on this.
Currently we have to run each service in individual shell, making a tedious and prone to error task. Also, we are running go run
command which is good for quick development, but i feel we should run your service in containerised environment since the services are being deployed as containers.
Docker swarm is also an option in my opinion, it will be a bit overkill.
We need to start with writing a docker-compose file and configuring services. Since we are already using the docker for building our images, we'll focus on docker-compose file only.
This is a clean-up task in order to keep things simple.
Clean-up claudie
namespace as we don't have any use-case for it. This will simplify our environment, so that we decrease the cognitive load a little.
claudie
namespace is cleaned-up from the cluster and all the manifestsclaudie-<sha256>
namespaces continue working properly and are still part of the CI runsAfter the successful run of the testing framework, the cluster the Claudie creates is not deleted
After the successful run of the testing framework, the cluster the Claudie creates, should be deleted
Error messaged returned to testing framework:
2021/09/03 13:27:09 Deleting the clusters from test set: tests/test-set1
2021/09/03 13:27:09 Error while processing tests/test-set1 : rpc error: code = Unknown desc = error while calling DestroyInfrastructure on Terraformer: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp: address :tcp://10.98.3.204:50052: too many colons in address"
platform_test.go:67:
Error Trace: platform_test.go:67
Error: Received unexpected error:
rpc error: code = Unknown desc = error while calling DestroyInfrastructure on Terraformer: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp: address :tcp://10.98.3.204:50052: too many colons in address"
Test: TestPlatform
This is important to keep a constant check if all the supported platforms are working as intended or not. The current testing framework only cater hetzner as the provider. The decision to go with hetzner nodes only, was taken in order to save the CI/CD time as hetzner is quick when it comes to spawning nodes and configure network (as compared to gcp).
It seems that the currently used DEV environment bucket for Claudie, develop_test_bucket
, has been created via ClickOps, because it seems to be absent from Berops/infra Terraform.
We generally always want to create all infrastructure via Terraform.
The objective of this task is to integrate that GCS
bucket into the Terraform code.
develop_test_bucket
is managed by Terraform in Berops/infraWhen a manifest is being processed, in a case where Builder unexpectedly fails and dies, Claudie does not detect this and continues to wait.
When a manifest is being processed, in a case where Builder unexpectedly fails and dies, the Claudie should detect it and should terminate due to error.
testing-framework
testing-framework
will not stop and will continue waiting until the testing framework timeoutThis might be happening with Scheduler as well since both of these services are implemented as clients. We need to find an optimal solution for checking if the client services are running or not.
Currently the naming conventions for the temporary namespaces takes the whole commit sha string and uses that as a suffix (claudie-<sha>
). That means that all the temp namespaces are long with too many chars that are difficult to follow. Shall we shorten them?
The idea would be to take a shorter version of the git commit sha, ideally just 6-7 chars (e.g. short commit hashes have just 7 chars).
Motivation
Application should be able to recover from the errors it meets on the execution path. It should not crash.
At the moment, the application simply crashes.
Description
Analyze the code and add a code for capturing and processing errors. At minimum each error should be logged with a message. On top of that, extra recovery might be needed in some cases (e.g. cleanup temporary structures like queues).
Exit Criteria
ContextBox
are being taken care ofScheduler
are being taken care ofBuilder
are being taken care ofTerraformer
are being taken care ofWireguardian
are being taken care ofKubeEleven
are being taken care ofI have already made a solution as a part of my diploma thesis so this bug should be easy to fix.
Nodes don’t launch wireguard on restart.
After restart, a node should have Wireguard running and be a part of VPN full mesh
Restart running node that is a part of Wireguard full mesh VPN
Right now testing framework doesn’t know about any errors that are happening in the platform's microservices which hugely extends the testing time (test will fail only on timeout). Propagate errors from the platform's services to the testing framework.
Log from testing-framework
2021/10/01 11:46:18 Waiting for 1.yaml to finish... [ 1980s elapsed ] 2021/10/01 11:46:48 Waiting for 1.yaml to finish... [ 2010s elapsed ] 2021/10/01 11:47:18 Waiting for 1.yaml to finish... [ 2040s elapsed ] 2021/10/01 11:47:48 Waiting for 1.yaml to finish... [ 2070s elapsed ] 2021/10/01 11:48:18 Waiting for 1.yaml to finish... [ 2100s elapsed ] 2021/10/01 11:48:48 Waiting for 1.yaml to finish... [ 2130s elapsed ] 2021/10/01 11:49:18 Waiting for 1.yaml to finish... [ 2160s elapsed ] 2021/10/01 11:49:48 Waiting for 1.yaml to finish... [ 2190s elapsed ] 2021/10/01 11:50:18 Waiting for 1.yaml to finish... [ 2220s elapsed ] 2021/10/01 11:50:48 Waiting for 1.yaml to finish... [ 2250s elapsed ]
Some providers require name uniqueness (for example GCP) for virtual machines and some other resources. In order to avoid this problem, we should generate a unique random hash for every config file.
Random hash should be generated before or as part of the applyTerraform()
function
Hash should be config file unique.
The motivation is to the cut down the duration of the e2e tests by enabling parallelization of the test sets.
We need to ensure that the test framework supports running mutiple test sets in parallel so that we can parallelize tests and get the result of the pipeline as soon as possible. Delays on the CI pipelines hurt productivity.
We need to figure out a storage system to enable stateful workloads.
Ability to run stateful workloads is a must. At the same time running stateful workloads is complex. Here the complexity is on another level considering the multi-/hybrid-cloud environment. At first a basic solution for most common use-cases will do.
Some of the utility functions like for example:
... are used multiple times in different modules. To avoid duplicity and for more code simplicity it is a good practice to move these utility functions to a separate package.
The task for this feature consists of finding mentioned functions and creating a new golang utility package for them. Then refactoring existing modules to work with the new package.
The log level verbosity should allow developers to choose from different levels of verbosity for logs, i.e. debug level, dev level, etc. This should be easily configurable, meaning switching between levels of verbosity should be as easy as setting one variable.
Motivation
The more public cloud providers the platform supports, the better are its chances of success as a multi-cloud solution.
Add more public cloud providers as options for NodePools, LBs...
We need to ensure that certain entry-points to the cluster will be highly available.
Every K8s cluster needs to have loadbalancers at least for Ingress and K8s API traffic. Eventually this should be easily configurable for other types of LBs (TCP, UDP, Nodeports) The catch here is that the solution needs to work:
We might need to explore the possibilities of using LB-as-a-Service (from cloud-providers) vs home-made (e.g. haproxy) deployments. Understand the availability of this solution.
If I remember correctly, there was a doubt raised by @samuelstolicny & @bernardhalas about nodepools.
I don't exactly remember what that was about - please clarify, fill in the Issue, or let's discuss it first - as you prefer.
Motivation
TODO
Description
TODO
Exit Criteria
We need to ensure that the secrets we use in the platform are stored and managed in a secure manner.
Analyze the needs the platform has on the secret management solution. Based on the needs, propose and implement a secret management solution.
Right now, the CD pipeline is called right after every successful CI. This approach is not considered the best because the codebase from PR will propagate to the dev deployment aka "claudie namespace" in our dev cluster after PR is created not after it's considered as working and merged to the master branch.
Run CD pipeline after the merge to master, not after successful CI.
Path to the CD pipeline YAML: .github/workflows/CD-pipeline-dev.yml
Current trigger:
on:
# Run after CI
workflow_run:
workflows: [ "CI pipeline for platform" ]
types: [ completed ]
# Manual trigger
workflow_dispatch:
This one should be really quick so I consider it as a "good first issue", but let's groom it first.
At the moment, Claudie uses a hardcoded S3 (GCS) bucket with a hardcoded name, which is presupposed to already exist.
In the future, we'll want to rework this into one or more of the following approaches:
First, we'll have to agree on an approach.
The Context-box will receive the first set of new configs. They will be added to the Scheduler queue and the Scheduler will process them normally. After they have been processed by the Scheduler, they are added to the Builder queue, to be processed by Builder. Builder will process them and all looks good.
Then the Context-box will receive a second set of configs. They will be added to the Scheduler queue and they will be processed by a Scheduler normally. After that, they are NOT added to a Builder queue. Only after the restart of the Context-box pod, the Builder queue is updated.
The Builder queue should be updated after the second set of processed configs from Scheduler
TestSaveConfigFrontEnd
in /services/context-box/client/client_test.go
TestSaveConfigFrontEnd
againTestSaveConfigFrontEnd
to add 15 new configs each time the test runs. (instead of one new config)time.Sleep
log.Println("I got config: ", config.GetName())
//config = callTerraformer(config)
//config = callWireguardian(config)
//config = callKubeEleven(config)
time.Sleep(60 * time.Second) //dummy "work"
config.CurrentState = config.DesiredState // Update currentState
Within Claudie we need LB-as-a-Service for making endpoints highly-available.
Following the LB POC #54 , implement a solution for building on-demand LB clusters. The confguration syntax is described in #39 , which makes it a pre-requisite. Please, bear in mind, that a single LB cluster may be used for multiple services.
Credentials are misconfigured in gcp.tpl
file in Terraformer module. Currently, it's working because terraform uses GCP credentials from backend.tpl
and is skipping gcp.tpl
credentials.
gcp.tpl
should consist functional path to GCP credentials because if we choose to use a different backend for Terraform, GCP provider will stop working.
KubeEleven fails to create a k8s cluster and goes into endless cycle of calling an api on APIEndpoint which keeps returning 404
. Whereas with hetzner, the k8s cluster is created successfully.
KubeEleven should be able to create a k8s cluster with GCP nodes as control nodes.
docker compose
.server/context-box/client/client_test.go
The issue is on master branch . Here are the debug.log from my run.
Motivation
Define interfaces (user or robotic ones) through which the clients can interact with platform. I assume we will need
While working on the concurrency task I stumbled upon this interesting finding. gRPC returns nil with error. Let me briefly explain with our codebase.
config, err = callWireguardian(config)
if err != nil && config != nil {
config.CurrentState = config.DesiredState // Update currentState
// save error message to config
config.ErrorMessage = err.Error()
errSave := cbox.SaveConfigBuilder(c, &pb.SaveConfigRequest{Config: config})
if errSave != nil {
return fmt.Errorf("error while saving the config: %v", err)
}
return fmt.Errorf("error in Wireguardian: %v", err)
}
Currently, if Wireguardian fails, we expect that it returns an error message with config back to Builder to set currentState and ErrorMessage and save it to the DB. But, because of specific gRPC's implementation, it always returns nil err (it is automatically generated in pb.proto file)
func (c *wireguardianServiceClient) BuildVPN(ctx context.Context, in *BuildVPNRequest, opts ...grpc.CallOption) (*BuildVPNResponse, error) {
out := new(BuildVPNResponse)
err := c.cc.Invoke(ctx, "/platform.WireguardianService/BuildVPN", in, out, opts...)
if err != nil {
return nil, err
}
return out, nil
}
Surprisingly I didn't find this in Terraformer. 🤔
A possible solution can be using rich error model.
Some useful links that I found about this topic:
https://stackoverflow.com/questions/61949913/why-cant-i-get-a-non-nil-response-and-err-from-grpc
https://stackoverflow.com/questions/48748745/pattern-for-rich-error-handling-in-grpc
https://grpc.io/docs/guides/error/#richer-error-model
Development environment is not fully configured via the code. In order for the environment to be brought up, the administrator of the environment needs to manually obtain several tools like Ansible, Terraform and KubeOne. The versioning of this toolset is not yet managed.
The project contains a description of the versions of tools it depends on and for which the testing and implementation is valid.
N/A
Motivation:
We need to be able to switch into log debug mode in case of any issue is occurring in order to be able to collect more data.
Description:
Rework all logging messages so that they include log verbosity definition. Feel free to consider using Log4Go library (or similar): https://github.com/jeanphorn/log4go.
Exit Criteria:
At the moment, Claudie doesn't validate the inputconfig in any way whatsoever.
There should probably be some form of input validation, so that the user gets feedback if they mess something up.
In order to save resources, we would like to move our production cluster (where the platform runs) from google autopilot to the GKE solution. The main reason is that autopilot is currently is not a suitable solution for our workload.
The task includes migration of current deployments to the GKE solution with autoscaling and persisting it as IaC( infrastructure as a code).
Use private address and load balancer (the reason is more described in #43).
For IaC use Terraform. The best place should be our infra repository.
Create Terraform manifests for GKE solution with
Motivation
At the moment, we're calling KubeOne via Shell, but there may exist a Golang-native API.
Using that instead could be cleaner.
Description
Refactor all usages of KubeOne shell calls into using a Golang-native API for KubeOne.
Exit Criteria
Child tasks from #50
We need to figure out a storage system to enable stateful workloads. This POC should find a suitable strategy using one of the storage solutions described below.
Consider the following storage solutions:
Focus on exploring the following strategy:
Orchestrate storage on the k8s cluster nodes by creating one storage cluster across multiple providers. This storage cluster will have a series of "zones", one for each cloud provider. Each zone should store its own persistent volume data.
Explore additional strategies if the one above turns out to be inappropriate/infeasible:
Motivation:
We need to be able to test the platform's functionality in a production-like environment.
Description:
Deploy the platform on a Kubernetes cluster and figure out how to run end-to-end tests on the current functionality of the platform.
Exit Criteria:
SaveConfigFrontEnd
message to Context-boxMotivation
We need to be able to support on-premise clusters and hybrid-cloud clusters.
At the moment, we're having issues due to pods not being schedulable, and GKE being totally helpless:
Warning FailedScheduling 2m17s (x28 over 41m) gke.io/optimize-utilization-scheduler 0/8 nodes are available: 7 Insufficient memory, 8 Insufficient cpu.
Normal NotTriggerScaleUp 2m14s (x227 over 38m) cluster-autoscaler pod didn't trigger scale-up (it wouldn't fit if a new node is added): 5 in backoff after failed scale-up
Furthermore, the pods actually get assigned some default resource requests/limits, but they are probably ridiculously high for the dev environment.
resources:
limits:
cpu: 500m
ephemeral-storage: 1Gi
memory: 2Gi
requests:
cpu: 500m
ephemeral-storage: 1Gi
memory: 2Gi
We should define resource requests for each K8s deployment, with sensible values.
That will help us avoid future problems with the pods being unschedulable, and it will enable us to fit more pods into less nodes.
resources:
section specified with sensible values based on real resource usage. Relevant MR is reviewed and merged.A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.