flant / addon-operator Goto Github PK

View Code? Open in Web Editor NEW

474.0 474.0 27.0 12.39 MB

A system to manage additional components for Kubernetes cluster in a simple, consistent and automated way.

Home Page: https://flant.github.io/addon-operator/

License: Apache License 2.0

Go 99.65% Dockerfile 0.35%

devops kubernetes kubernetes-addons kubernetes-operators

addon-operator's People

Contributors

Stargazers

Watchers

addon-operator's Issues

Straighten namespace and pod name resolution logic

This issue is a follow-up for a fast fix introduced in #46 .

We shall do it in a proper way:

Remove hardcoded namespace name
Don't use Hostname as a default for pod
Add two CLI args and use two existing env variables as defaults for them (ADDON_OPERATOR_NAMESPACE and ADDON_OPERATOR_POD)

Auto healing of module helm release

New helm run deduplication was introduced in #67. And now there is no simple way to upgrade module helm release. Addon-operator should check release resources and add ModuleRun task if resources are deleted externally. Also, helm run deduplication should ignore values checksum if resources are deleted by beforeHelm hooks.

Support for conversion and validating webhooks

Is your feature request related to a problem? Please describe.
There are two great PRs in the shell-operator, which add the functionality to use hooks as validating webhooks or conversion webhooks for CRDs.
flant/shell-operator#223
flant/shell-operator#250
We need to add the support for these kinds of hooks to addon-operator modules.

Additional context:*
The most valuable part is switching hooks on and off on enabling or disabling a module (shell-operator is not capable of doing so without restarts).

JSON-patch values update partially broken

Hello.
I have met weird behavior of json patch in addon-operator.

Short description:

array add works but sometimes rewrite full array instead of appending (rarely, I can't reproduce it in 100% times) - its not a main goal
json patch array remove doesn't work at all. The temporary json-patch is created, applied by operator (guess so, because the file has been disappeared), hook is finished successfully but no values have been changed.

You can find detailed description and live demo here

This code expected to work based on the examples and articles but it doesn't. Maybe I'm doing something wrong, then point me to the error, please.

Thank you.

Release v1.0.0-beta.6

Steps:

refactor custom metrics to use code from shell-operator flant/shell-operator#165 (done in #112)
fix tiller error monitoring — sometimes Addon-operator not exits with 1 (done in #113 #114)
add patches subcommand in RUNNING.md #8 (comment) (done in #115)
code spell #105
fix lint errors (done in #116)

Prioritize global hooks above modules

Hi there.
Thank you for the cool stuff!
Let me describe the case to make this proposal clear.

I have a global hook self-update.sh (with schedule task) which every N minutes checks a new addon-operator image and pull it to a cluster. This move brings new modules / patches / etc to a cluster.

Then, I have made some broken module my-module-X, which can not be deployed due to some reasons (just personal mistake, env troubles, etc). And it has stucked in the task queue, trying to be released every 5 seconds.
After I have found an error into the module and have fix it, I pull the new image with addon-operator. So, it seems pretty easy: currently working addon-operator pod should find the new image in a few minutes and redeploy it but it`s not possible, because the tasks queue stucked in the next state:

Queue length 296

TASK_MODULE_RUN 'my-module-X' failed 579 times. 
# other modules #
TASK_GLOBAL_HOOK_RUN 'self-update.sh'
TASK_GLOBAL_HOOK_RUN 'self-update.sh'
TASK_GLOBAL_HOOK_RUN 'self-update.sh'
TASK_GLOBAL_HOOK_RUN 'self-update.sh'
TASK_GLOBAL_HOOK_RUN 'self-update.sh'
TASK_GLOBAL_HOOK_RUN 'self-update.sh'
...

The solution here is to redeploy addon-operator pod in all clusters. But is brokes all the automation stuff.

That would be nice if global hooks (or maybe only schedule tasks) come at the top of the queue. It will make self-repairing based on schedule/event possible.

Thank you.

Update documentation about modules enablement mechanism

We've forgotten to update the documentation in the #33 and now we have misleading information in LIFECYCLE.md (and maybe in other places as well). We should fix docs ASAP.

Module CRD

Values are parameters for each module. CRD can be used to define them instead of yaml-strings in a ConfigMap.

validation
each module defines own CRD
addon-operator defines global CRD

Run hooks in named Task queues

Inspired by #30

There are several use cases when modules and hooks should not block schedule and kubernetes hooks. For example, secret copier or some cleaning tasks. Addon-operator can start several named queues and handlers to run hooks in parallel. schedule and onKubernetesEvent binding config should have a new flag with a name of the desired queue. Flag name is something like queueName and empty flag is adding hook to the default or main queue.

The workaround for this problem is using several instances of addon-operator and shell-operator, but hooks with flag is much better for creating high quality modules (addons).

Improve Tiller startup and logging

Choose open ports for tiller if ports are already in use. Useful for setups with hostNetwork: true.
Route tiller stdout and stderr to log messages.

Execute Synchronization hooks in parallel

Synchronization hooks can be run in parallel if different queues are defined in bindings.

A preliminary plan:

introduce wait status for a task result. A task handler restarts a task every N ms until the task returns this status.
add flags synchronizationDone and synchronizationQueued to binding configs
queue Synchronization task into specified queue, not into main
ModuleRun task queues synchronization tasks and returns a wait status
ModuleRun executes beforeHelm hooks only when all kubernetes binding configs has synchronizationDone flag set to true.
synchronizationDone is set to true on successful execution of a Synchronization hook.

Values are not updated on module discover error loop

Module discovery can get stuck in error loop, i.e. on enabled script crush. In this case values for enabled script is not updated between enabled script executions.
This happens on startup.

Add new module template

Is your feature request related to a problem? Please describe.
The main problem is that it is hard to develop a new module from scratch. Users need to know what they can add and how it changes module behavior.

Describe the solution you'd like to see
It would be convenient to have an example/template of a new module. For example, let's look at the helm create command. By calling it, a user creates a new folder with every file which has a special meaning (Chart.yaml, values.yaml, etc.). Each file contains a description of what you can change in it.
Personally, I don't think we need a command, a folder that you can copy to create a new module should be enough.

Files which should be included in the template:

README.md
enable.sh
hooks
templates
templates/openapi
values.yaml
.helmignore
Chart.yaml (?)

Describe alternatives you've considered
I think the only alternative for users is to read docs and learn what features the addon-operator has. This is hard for newcomers, though.

Helper for shell hooks: cli command to apply json patches

Design is as simple as:
addon-operator jsonpatch apply "patch" < input_json

e2e tests

flant/shell-operator#63

Too many json patches from hooks lead to memory leak and high cpu usage

Technical debt: patches should be compacted. https://github.com/flant/addon-operator/blob/master/pkg/utils/values.go#L174-L178

Add support for upper-level `values.yaml` (common for all modules) in module enablement mechanism

We have different behavior in two mechanisms, but it is counterintuitive and will definitely mislead all the users. When we merge values the sequence is: modules/values.yaml, modules/000-some-module-name/values.yaml, ConfigMap. And for module enablement mechanism we are missing the first part (modules/values.yaml) for now.
Ability to choose whether the module is enabled or not in common values.yaml will allow implementing the scenario with a few sets of modules, enabled by default – for example, it will allow having "minimal" and "standard" sets of modules. Two do that will have to create two different values.yaml and put the right one in the well-known location during startup.

Improve logging fields

global 'kubernetes' hook can trigger 'modules reload' and ModuleRun tasks will have 'hook' field with the name of hook that triggered an event but BeforeAll and AfterAll tasks will have 'hook' field with that hook name. An additional field is required, for example, 'event.triggered-by'.
'module' field is empty for ModuleHookRun tasks and for 'queue task ModuleHookRun' records.
No information about an event for module 'kubernetes' hooks. It will be helpful if log record contain fields with binding name and watchEvent.

Add werf support to deploy modules

~~Integrate helm and kubedog as werf do.~~

~~fix more helm issues~~
- ~~auto rollback~~
~~watch for resources on module run: ease diagnotsic with more informative logs~~

Update 27.11.2020:

Add werf as another "helm-client"
- initial implementation in #50
- update PR to support helm command from werf 1.1
- add WERF_VERSION environment variable

New hooks API (update shell-operator)

Update dependency and make addon-operator own hooks in line with new API, after flant/shell-operator#36, flant/shell-operator#37, flant/shell-operator#33, flant/shell-operator#32, and flant/shell-operator#35 will be implemented in shell-operator.

Fix: Run kubernetes hooks with Synchronization event right after onStartup hooks

Relates to changes made for #42 and #41.

Right now kubernetes hooks will run after discovery all modules. But Synchronization event is useful for discovery information that is needed by modules.

EnableGlobalHooks method should return a set of tasks that will be queued after onStartup hooks.

Public contracts for values

Allow modules to have a public contract:

create a public section with keys named as modules
this section can be patched from global and module hooks

Or just allow to patch global values?

Eliminate hard code regex for module name

bad module directory names, must match regex '^[0-9][0-9][0-9]-(.*)$': /addons/modules/sysctl-tuner

Consistency of reconciliation and onKubernetesEvents

Same as flant/shell-operator#42, but with all the modules logic in mind.

Proper ModuleRun restart if there are errors on startup

Module 'onStartup' hooks are not restarted if error is occured.

Binding context for schedule hook pretend to be a context for kubernetes

Some hooks receive a wrong binding context and error is occurred:

schedule       ERROR: Can't find any handler from the list: __on_kubernetes::main, __main__"

    configVersion: v1
    schedule:
    - name: main
      queue: /modules/$(module::name::kebab_case)/schedul
      crontab: "* * * * *"
      includeSnapshotsFrom: [pods,nodes]
   kubernetes:
   - name: pods
      ...
   - name: nodes
      ...

Allow multiple Json patches for VALUES_JSON_PATCH_PATH

json.Decoder allows parsing JSON stream, so the hook can output JSON patches or simply JSON patch operations line by line and don't worry about commas and square brackets.

https://play.golang.org/p/kwC8R7c1gcp

Actualize documentation about Helm and modules

Is your feature request related to a problem? Please describe.

No Helm3 info in MODULES.md, "Tiller" header is misleading.
No info about purge releases at the start.

Describe the solution you'd like to see

Helm3 compatibility should be stated in MODULES.md. workarounds-for-helm-issues and tiller should be combined under Helm2 header.
addon-operator list releases at the start and purges releases without corresponding modules. It should be stated in LIFECYCLE.md.

Describe alternatives you've considered

Additional context

Error in resolving Helm version

Hello Dev team!

Original Dockerfile from repository with changed kubectl and helm version:

 # build libjq
FROM ubuntu:18.04 AS libjq
ENV DEBIAN_FRONTEND=noninteractive \
    DEBCONF_NONINTERACTIVE_SEEN=true \
    LC_ALL=C.UTF-8 \
    LANG=C.UTF-8

RUN apt-get update && \
    apt-get install -y git ca-certificates && \
    git clone https://github.com/flant/libjq-go /libjq-go && \
    cd /libjq-go && \
    git submodule update --init --recursive && \
    /libjq-go/scripts/install-libjq-dependencies-ubuntu.sh && \
    /libjq-go/scripts/build-libjq-static.sh /libjq-go /libjq


# build addon-operator binary linked with libjq
FROM golang:1.14 AS addon-operator
ARG appVersion=latest

# Cache-friendly download of go dependencies.
ADD go.mod go.sum /addon-operator/
WORKDIR /addon-operator
RUN go mod download

COPY --from=libjq /libjq /libjq
ADD . /addon-operator
WORKDIR /addon-operator

RUN git submodule update --init --recursive && ./go-build.sh $appVersion

FROM krallin/ubuntu-tini:bionic AS tini

# build final image
FROM ubuntu:18.04
RUN apt-get update && \
    apt-get install -y ca-certificates wget jq && \
    rm -rf /var/lib/apt/lists && \
    wget https://storage.googleapis.com/kubernetes-release/release/v1.17.9/bin/linux/amd64/kubectl -O /bin/kubectl && \
    chmod +x /bin/kubectl && \
    wget https://storage.googleapis.com/kubernetes-helm/helm-v2.16.10-linux-amd64.tar.gz -O /helm.tgz && \
    tar -z -x -C /bin -f /helm.tgz --strip-components=1 linux-amd64/helm linux-amd64/tiller && \
    rm -f /helm.tgz && \
    helm init --client-only && \
    mkdir /hooks
COPY --from=tini /usr/local/bin/tini /sbin/tini
COPY --from=addon-operator /addon-operator/addon-operator /
WORKDIR /
ENV MODULES_DIR /modules
ENV GLOBAL_HOOKS_DIR /global-hooks

Addon-operator is recognize Helm version >2.16 as Helm 3

part of log with debug on:

[DEBUG] Executing command 'helm version --short' in '' dir
{INFO] Helm 3 version: Client: v2.16.10+gbceca24 Server: v2.16.10+gbceca24

With this issue we take such error because Addon-operator use module for helm3:

ModuleRun failed. Requeue task to retry after delay. Failed count is 117. Error: helm upgrade failed: exit status 1:
 Error: stat prometheus-operator: no such file or directory  binding=ReloadAllModules event.type=OperatorStartup module=prometheus-operator module.state=failed operator.component=taskRunner queue=main task.id=5c3f5e9b-62fb-4760-9b5b-7d6c4def1974

Resolve this issue by changes in code of module helm.go (hot solution,not better):

func Init(client kube.KubernetesClient) error {

	HelmVer := os.Getenv("HELM_CLIENT_VERSION")

	if HelmVer == "3" {
		// Try helm3 first
		err := helm3.Init(&helm3.Helm3Options{
			Namespace:  app.Namespace,
			HistoryMax: app.Helm3HistoryMax,
			Timeout:    app.Helm3Timeout,
			KubeClient: client,
		})
		if err == nil {
			NewClient = helm3.NewClient
			return nil
		}
	}
	// Fallback to helm2
	// TODO make tiller cancelable
	err := helm2.InitTillerProcess(helm2.TillerOptions{
		Namespace:          app.Namespace,
		HistoryMax:         app.TillerMaxHistory,
		ListenAddress:      app.TillerListenAddress,
		ListenPort:         app.TillerListenPort,
		ProbeListenAddress: app.TillerProbeListenAddress,
		ProbeListenPort:    app.TillerProbeListenPort,
	})
	if err != nil {
		return fmt.Errorf("init tiller: %s", err)
	}

	// Initialize helm2 client
	err = helm2.Init(&helm2.Helm2Options{
		Namespace:  app.Namespace,
		KubeClient: client,
	})
	if err != nil {
		return fmt.Errorf("init helm client: %s", err)
	}
	NewClient = helm2.NewClient
	HealthzHandler = helm2.TillerHealthHandler()
	return nil
}

Module should restart if values are changed in afterHelm

Module should be rescheduled in this case. There are two ways of implementation:

module can be added to the end of the queue
module can be restarted in-place as in case of error

Allow disabling of module having errors (during it's installation or in hooks)

The current implementation has possible deadlock: if the module has errors, either in the helm templates or in hooks, the tasks are stuck in the queue and disabling the module in ConfigMap don't help, because of module tasks are not removed from the queue.

Proposed change: When the module becomes disabled (in ConfigMap, or by any other means) the following should happen (in the following order):

All informers should be stopped and all crontab bindings should be disabled, so no new tasks (possibly containing errors) will be added to the queue.
All existing module tasks should be removed from the queue.
Tasks associated with module disabling should be added (helm uninstall, and afterDeleteHelm hooks)

Allow specifying name, namespace and order for the module

Problems:

We automatically camel-case module name, but if the module name has some abbreviation – we get ugly things, like getting nodeLocalDns, instead of 'nodeLocalDNS`.
Sometimes we need to know module namespace, but we don't have centralized mapping.
It would be generally better if we will pass namespace to helm install, at least:
- we will have namespaces in helm list,
- we will be able to use Release.Namespace, instead of hardcoding namespace name into each template.
Module order is part of ugly hardcode.

Proposal:

Add module.yaml to each module, and make it required. It should contain, at least, name (camel-cased), namespace and order, all required.
Use name (camel-cased) from module.yaml when generating helm values, and in all other places (list of enabled modules, logs, etc).
Add the automagic global variable with module-to-namespace mapping.

Extra logic:

When namespace of the module is changed – perhaps the module should be uninstalled first (as far I remember helm doesn't allow changing namespace for the installed release).
When module name changes, for addon-operator it should be similar to one module gone, and a new one added.

Deduplicator of Helm runs should use helm render

The checksum is now calculated over the values and files in chart directory. But changes in values are not necessary should lead to helm update. Calculate checksum over helm render output should be more useful.

[META] Sprints planning

This plan can be reviewed in case of new issues. Fill free to comment issues in milestones and we do our best to stick to the dates.

TODO: plan and links to milestones

Refactor handling labels for logging, metrics and debug commands

There is a myriad of map[string]string initialized and passed between layers. The upper-level code should know which labels to pass to the lower-level methods and lower-level has no access to additional information.
It seems that Context can help: labels can be stored as context values. Upper-level code just pass a context and lower-level code retrieve everything from this context.

This requires changes in shell-operator too.

Modules discovery process rethinking

Discovery process is now performed statically: it calculates enable state for all modules at once. The better way is to calculate first enabled module and run it, then calculate next enabled module and run it, go on until all modules are run. This, more dynamical algorithm, also helps resolve #16 and #43.

ModuleRun tasks can be queued while global hooks are run

A global hook can change module section in configmap/addon-operator and this leads to run an enabled script. It is OK when all global onStartup and kubernetes.Synchronization hooks are already run and it is terrible if enabled script depends on several global hook results (think of discovery).

One of possible solutions is to disable handling changes in module sections until the first DiscoverModuleState task is finished.

Secret values

All values are now in cm/addon-operator. That is bad for sensitive data.

Support Vault or something like this.

Implement metrics for failing values validations

Since we have validations of module configurations stored in ConfigMap which could be changed directly via kubectl - validation error may occur and would be visible only in the operator logs.

To reach observability of misconfigured module we might want to implement a metric, representing described condition.

This metric would be used in alerts, from which system operators will be notified about a fact of a configuration error.

If --config of module's hook failed, hook ignored

If hook returns invalid json for --config request, addon-operator just ignores this hook.
In this case addon-operator must fail or keep trying to --config.

Full support for CRDs

After full support for CRDs will be implemented in shell-operator, implement the same logic in addon-operator, but with modules in mind:

CRDs for modules should be installed only when the module is enabled.
Disabling the module (and removing) should uninstall CRDs having no objects (CRDs having stored objects should not be deleted).
TODO

Non-goals:

Migration from ConfigMap to CRD is not a goal of this feature and should be addressed separately.

Introduce names of different types of values for the documentation and further use

@diafour, what do you think if we introduce the following names and change the documentation appropriately:

global static values (/modules/values.yaml)
module static values (modules/123-some-module/values.yaml)
global config values
module config values
global dynamic values and global dynamic values patches
module dynamic values and module dynamic values patches
resulting global values
resulting module values
resulting values, or just values – resulting global and module values merged
?

And of course we shall briefly (but precisely and accessibly) describe each of this type of values, right in the beginning of the VALUES.md, and then use these names everywhere.

Non-informative error about failed enabled script

If enabled script is not exited with code 0, the error message hasn't enough information about what exactly enabled script that was.

{"level":"info",
  "msg":"Modules enabled by script: []",
  "operator.component":"moduleManager,discoverModulesState"}

{"event.id":"OperatorOnStartup","level":"error",
  "msg":"DiscoverModulesState failed, queue Delay task to retry. Failed count is 24. Error: exit status 1",
  "operator.component":"TaskRunner","task.id":"XXX","task.type":"DiscoverModulesState"}

addon-operator/pkg/module_manager/module.go

Line 490 in f40af75

return false, err

Also, add more information to other errors in checkIsEnabledByScript method.

JSON logs

flant/shell-operator#62

Allow use the external chart for module

The module can have no templates directory and have dependencies in Chart.yaml. The problem is that addon-operator construct values for helm in this form:

global:
  globalKey1: ...
  globalKey2: ...
  ...
moduleNameInCamelCase:
  moduleKey1: ...
  moduleKey2: ...

But external chart does not "expect" values in this form. Should we define values transformation in module.yaml (#39) or there is a better solution for this problem?

Global hooks with schedules do not work

Hi. I am trying to make global hook with schedule like that

cat global-hooks/check-update.sh 
#!/usr/bin/env bash
if [[ $1 == "--config" ]] ; then
  cat <<EOF
{
  "configVersion":"v1",
  "schedule": [
    {
      "name": "every 1 min",
      "crontab": "* * * * *"
    }
  ],
  "onStartup": 1
}
EOF
exit 0
fi

echo "I AM GLOBAL HOOK"
echo $(date) >> /tmp/date.txt

Only onStartup action starts script.

time="2020-02-29T08:27:30Z" level=info msg="I AM GLOBAL HOOK" binding=onStartup event.type=OperatorStartup hook=check-update.sh hook.type=global output=stdout queue=main task.id=e54568b1-449d-472d-8065-744bdf4b1554
time="2020-02-29T08:27:30Z" level=info msg="Global hook success" binding=onStartup event.type=OperatorStartup hook=check-update.sh hook.type=global operator.component=taskRunner queue=main task.id=e54568b1-449d-472d-8065-744bdf4b1554

Also if I make the script as module hook it works.

Related logs - https://pastebin.com/Vs01a3dT
There are also 2 modules, but they are disabled.

Is it supported right now?

Add go module

Use go modules instead of go get

Builds:

addon-operator/Dockerfile

Line 4 in 22bd314

RUN go get -d github.com/flant/addon-operator/...

addon-operator/Dockerfile-alpine3.9

Line 4 in 22bd314

RUN go get -d github.com/flant/addon-operator/...

https://github.com/golang/go/wiki/Modules

debug command: module render

Add a command to render module helm templates with current merged values or ith values from file.

addon-operator module render <module-name> [--set stringArray] [--values file.yaml]

Fix debug process

Same steps as in shell-operator: flant/shell-operator#8

flant / addon-operator Goto Github PK

addon-operator's People

Contributors

Stargazers

Watchers

Forkers

addon-operator's Issues

Hello Dev team!

Recommend Projects

Recommend Topics

Recommend Org