Giter VIP home page Giter VIP logo

addon-operator's People

Contributors

akaitux avatar alex123012 avatar alexey-igrychev avatar asviel avatar dependabot[bot] avatar diafour avatar distorhead avatar dmitrykob avatar flant-team-sysdev avatar libmonsoon-dev avatar miklezzzz avatar nabokihms avatar name212 avatar shurup avatar yalosev avatar z9r5 avatar zuzzas avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

addon-operator's Issues

Auto healing of module helm release

New helm run deduplication was introduced in #67. And now there is no simple way to upgrade module helm release. Addon-operator should check release resources and add ModuleRun task if resources are deleted externally. Also, helm run deduplication should ignore values checksum if resources are deleted by beforeHelm hooks.

Support for conversion and validating webhooks

Is your feature request related to a problem? Please describe.
There are two great PRs in the shell-operator, which add the functionality to use hooks as validating webhooks or conversion webhooks for CRDs.
flant/shell-operator#223
flant/shell-operator#250
We need to add the support for these kinds of hooks to addon-operator modules.

Additional context:*
The most valuable part is switching hooks on and off on enabling or disabling a module (shell-operator is not capable of doing so without restarts).

JSON-patch values update partially broken

Hello.
I have met weird behavior of json patch in addon-operator.

Short description:

  • array add works but sometimes rewrite full array instead of appending (rarely, I can't reproduce it in 100% times) - its not a main goal
  • json patch array remove doesn't work at all. The temporary json-patch is created, applied by operator (guess so, because the file has been disappeared), hook is finished successfully but no values have been changed.

You can find detailed description and live demo here

This code expected to work based on the examples and articles but it doesn't. Maybe I'm doing something wrong, then point me to the error, please.

Thank you.

Prioritize global hooks above modules

Hi there.
Thank you for the cool stuff!
Let me describe the case to make this proposal clear.

I have a global hook self-update.sh (with schedule task) which every N minutes checks a new addon-operator image and pull it to a cluster. This move brings new modules / patches / etc to a cluster.

Then, I have made some broken module my-module-X, which can not be deployed due to some reasons (just personal mistake, env troubles, etc). And it has stucked in the task queue, trying to be released every 5 seconds.
After I have found an error into the module and have fix it, I pull the new image with addon-operator. So, it seems pretty easy: currently working addon-operator pod should find the new image in a few minutes and redeploy it but it`s not possible, because the tasks queue stucked in the next state:

Queue length 296

TASK_MODULE_RUN 'my-module-X' failed 579 times. 
# other modules #
TASK_GLOBAL_HOOK_RUN 'self-update.sh'
TASK_GLOBAL_HOOK_RUN 'self-update.sh'
TASK_GLOBAL_HOOK_RUN 'self-update.sh'
TASK_GLOBAL_HOOK_RUN 'self-update.sh'
TASK_GLOBAL_HOOK_RUN 'self-update.sh'
TASK_GLOBAL_HOOK_RUN 'self-update.sh'
...

The solution here is to redeploy addon-operator pod in all clusters. But is brokes all the automation stuff.

That would be nice if global hooks (or maybe only schedule tasks) come at the top of the queue. It will make self-repairing based on schedule/event possible.

Thank you.

Module CRD

Values are parameters for each module. CRD can be used to define them instead of yaml-strings in a ConfigMap.

  • validation
  • each module defines own CRD
  • addon-operator defines global CRD

Run hooks in named Task queues

Inspired by #30

There are several use cases when modules and hooks should not block schedule and kubernetes hooks. For example, secret copier or some cleaning tasks. Addon-operator can start several named queues and handlers to run hooks in parallel. schedule and onKubernetesEvent binding config should have a new flag with a name of the desired queue. Flag name is something like queueName and empty flag is adding hook to the default or main queue.

The workaround for this problem is using several instances of addon-operator and shell-operator, but hooks with flag is much better for creating high quality modules (addons).

Improve Tiller startup and logging

  • Choose open ports for tiller if ports are already in use. Useful for setups with hostNetwork: true.
  • Route tiller stdout and stderr to log messages.

Execute Synchronization hooks in parallel

Synchronization hooks can be run in parallel if different queues are defined in bindings.

A preliminary plan:

  • introduce wait status for a task result. A task handler restarts a task every N ms until the task returns this status.
  • add flags synchronizationDone and synchronizationQueued to binding configs
  • queue Synchronization task into specified queue, not into main
  • ModuleRun task queues synchronization tasks and returns a wait status
  • ModuleRun executes beforeHelm hooks only when all kubernetes binding configs has synchronizationDone flag set to true.
  • synchronizationDone is set to true on successful execution of a Synchronization hook.

Add new module template

Is your feature request related to a problem? Please describe.
The main problem is that it is hard to develop a new module from scratch. Users need to know what they can add and how it changes module behavior.

Describe the solution you'd like to see
It would be convenient to have an example/template of a new module. For example, let's look at the helm create command. By calling it, a user creates a new folder with every file which has a special meaning (Chart.yaml, values.yaml, etc.). Each file contains a description of what you can change in it.
Personally, I don't think we need a command, a folder that you can copy to create a new module should be enough.

Files which should be included in the template:

  • README.md
  • enable.sh
  • hooks
  • templates
  • templates/openapi
  • values.yaml
  • .helmignore
  • Chart.yaml (?)

Describe alternatives you've considered
I think the only alternative for users is to read docs and learn what features the addon-operator has. This is hard for newcomers, though.

Add support for upper-level `values.yaml` (common for all modules) in module enablement mechanism

  1. We have different behavior in two mechanisms, but it is counterintuitive and will definitely mislead all the users. When we merge values the sequence is: modules/values.yaml, modules/000-some-module-name/values.yaml, ConfigMap. And for module enablement mechanism we are missing the first part (modules/values.yaml) for now.
  2. Ability to choose whether the module is enabled or not in common values.yaml will allow implementing the scenario with a few sets of modules, enabled by default – for example, it will allow having "minimal" and "standard" sets of modules. Two do that will have to create two different values.yaml and put the right one in the well-known location during startup.

Improve logging fields

  1. global 'kubernetes' hook can trigger 'modules reload' and ModuleRun tasks will have 'hook' field with the name of hook that triggered an event but BeforeAll and AfterAll tasks will have 'hook' field with that hook name. An additional field is required, for example, 'event.triggered-by'.

  2. 'module' field is empty for ModuleHookRun tasks and for 'queue task ModuleHookRun' records.

  3. No information about an event for module 'kubernetes' hooks. It will be helpful if log record contain fields with binding name and watchEvent.

Add werf support to deploy modules

Integrate helm and kubedog as werf do.

  • fix more helm issues
    • auto rollback
  • watch for resources on module run: ease diagnotsic with more informative logs

Update 27.11.2020:

  • Add werf as another "helm-client"
    • initial implementation in #50
    • update PR to support helm command from werf 1.1
    • add WERF_VERSION environment variable

Public contracts for values

Allow modules to have a public contract:

  • create a public section with keys named as modules
  • this section can be patched from global and module hooks

Or just allow to patch global values?

Binding context for schedule hook pretend to be a context for kubernetes

Some hooks receive a wrong binding context and error is occurred:

schedule       ERROR: Can't find any handler from the list: __on_kubernetes::main, __main__"
    configVersion: v1
    schedule:
    - name: main
      queue: /modules/$(module::name::kebab_case)/schedul
      crontab: "* * * * *"
      includeSnapshotsFrom: [pods,nodes]
   kubernetes:
   - name: pods
      ...
   - name: nodes
      ...

Actualize documentation about Helm and modules

Is your feature request related to a problem? Please describe.

  • No Helm3 info in MODULES.md, "Tiller" header is misleading.

  • No info about purge releases at the start.

Describe the solution you'd like to see

  1. Helm3 compatibility should be stated in MODULES.md. workarounds-for-helm-issues and tiller should be combined under Helm2 header.
  2. addon-operator list releases at the start and purges releases without corresponding modules. It should be stated in LIFECYCLE.md.

Describe alternatives you've considered

Additional context

Error in resolving Helm version

Hello Dev team!

Original Dockerfile from repository with changed kubectl and helm version:

 # build libjq
FROM ubuntu:18.04 AS libjq
ENV DEBIAN_FRONTEND=noninteractive \
    DEBCONF_NONINTERACTIVE_SEEN=true \
    LC_ALL=C.UTF-8 \
    LANG=C.UTF-8

RUN apt-get update && \
    apt-get install -y git ca-certificates && \
    git clone https://github.com/flant/libjq-go /libjq-go && \
    cd /libjq-go && \
    git submodule update --init --recursive && \
    /libjq-go/scripts/install-libjq-dependencies-ubuntu.sh && \
    /libjq-go/scripts/build-libjq-static.sh /libjq-go /libjq


# build addon-operator binary linked with libjq
FROM golang:1.14 AS addon-operator
ARG appVersion=latest

# Cache-friendly download of go dependencies.
ADD go.mod go.sum /addon-operator/
WORKDIR /addon-operator
RUN go mod download

COPY --from=libjq /libjq /libjq
ADD . /addon-operator
WORKDIR /addon-operator

RUN git submodule update --init --recursive && ./go-build.sh $appVersion

FROM krallin/ubuntu-tini:bionic AS tini

# build final image
FROM ubuntu:18.04
RUN apt-get update && \
    apt-get install -y ca-certificates wget jq && \
    rm -rf /var/lib/apt/lists && \
    wget https://storage.googleapis.com/kubernetes-release/release/v1.17.9/bin/linux/amd64/kubectl -O /bin/kubectl && \
    chmod +x /bin/kubectl && \
    wget https://storage.googleapis.com/kubernetes-helm/helm-v2.16.10-linux-amd64.tar.gz -O /helm.tgz && \
    tar -z -x -C /bin -f /helm.tgz --strip-components=1 linux-amd64/helm linux-amd64/tiller && \
    rm -f /helm.tgz && \
    helm init --client-only && \
    mkdir /hooks
COPY --from=tini /usr/local/bin/tini /sbin/tini
COPY --from=addon-operator /addon-operator/addon-operator /
WORKDIR /
ENV MODULES_DIR /modules
ENV GLOBAL_HOOKS_DIR /global-hooks

Addon-operator is recognize Helm version >2.16 as Helm 3

part of log with debug on:

[DEBUG] Executing command 'helm version --short' in '' dir
{INFO] Helm 3 version: Client: v2.16.10+gbceca24 Server: v2.16.10+gbceca24

With this issue we take such error because Addon-operator use module for helm3:

ModuleRun failed. Requeue task to retry after delay. Failed count is 117. Error: helm upgrade failed: exit status 1:
 Error: stat prometheus-operator: no such file or directory  binding=ReloadAllModules event.type=OperatorStartup module=prometheus-operator module.state=failed operator.component=taskRunner queue=main task.id=5c3f5e9b-62fb-4760-9b5b-7d6c4def1974

Resolve this issue by changes in code of module helm.go (hot solution,not better):

func Init(client kube.KubernetesClient) error {

	HelmVer := os.Getenv("HELM_CLIENT_VERSION")

	if HelmVer == "3" {
		// Try helm3 first
		err := helm3.Init(&helm3.Helm3Options{
			Namespace:  app.Namespace,
			HistoryMax: app.Helm3HistoryMax,
			Timeout:    app.Helm3Timeout,
			KubeClient: client,
		})
		if err == nil {
			NewClient = helm3.NewClient
			return nil
		}
	}
	// Fallback to helm2
	// TODO make tiller cancelable
	err := helm2.InitTillerProcess(helm2.TillerOptions{
		Namespace:          app.Namespace,
		HistoryMax:         app.TillerMaxHistory,
		ListenAddress:      app.TillerListenAddress,
		ListenPort:         app.TillerListenPort,
		ProbeListenAddress: app.TillerProbeListenAddress,
		ProbeListenPort:    app.TillerProbeListenPort,
	})
	if err != nil {
		return fmt.Errorf("init tiller: %s", err)
	}

	// Initialize helm2 client
	err = helm2.Init(&helm2.Helm2Options{
		Namespace:  app.Namespace,
		KubeClient: client,
	})
	if err != nil {
		return fmt.Errorf("init helm client: %s", err)
	}
	NewClient = helm2.NewClient
	HealthzHandler = helm2.TillerHealthHandler()
	return nil
}

Allow disabling of module having errors (during it's installation or in hooks)

The current implementation has possible deadlock: if the module has errors, either in the helm templates or in hooks, the tasks are stuck in the queue and disabling the module in ConfigMap don't help, because of module tasks are not removed from the queue.

Proposed change: When the module becomes disabled (in ConfigMap, or by any other means) the following should happen (in the following order):

  1. All informers should be stopped and all crontab bindings should be disabled, so no new tasks (possibly containing errors) will be added to the queue.
  2. All existing module tasks should be removed from the queue.
  3. Tasks associated with module disabling should be added (helm uninstall, and afterDeleteHelm hooks)

Allow specifying name, namespace and order for the module

Problems:

  1. We automatically camel-case module name, but if the module name has some abbreviation – we get ugly things, like getting nodeLocalDns, instead of 'nodeLocalDNS`.
  2. Sometimes we need to know module namespace, but we don't have centralized mapping.
  3. It would be generally better if we will pass namespace to helm install, at least:
    • we will have namespaces in helm list,
    • we will be able to use Release.Namespace, instead of hardcoding namespace name into each template.
  4. Module order is part of ugly hardcode.

Proposal:

  1. Add module.yaml to each module, and make it required. It should contain, at least, name (camel-cased), namespace and order, all required.
  2. Use name (camel-cased) from module.yaml when generating helm values, and in all other places (list of enabled modules, logs, etc).
  3. Add the automagic global variable with module-to-namespace mapping.

Extra logic:

  1. When namespace of the module is changed – perhaps the module should be uninstalled first (as far I remember helm doesn't allow changing namespace for the installed release).
  2. When module name changes, for addon-operator it should be similar to one module gone, and a new one added.

Deduplicator of Helm runs should use helm render

The checksum is now calculated over the values and files in chart directory. But changes in values are not necessary should lead to helm update. Calculate checksum over helm render output should be more useful.

[META] Sprints planning

This plan can be reviewed in case of new issues. Fill free to comment issues in milestones and we do our best to stick to the dates.

TODO: plan and links to milestones

Refactor handling labels for logging, metrics and debug commands

There is a myriad of map[string]string initialized and passed between layers. The upper-level code should know which labels to pass to the lower-level methods and lower-level has no access to additional information.
It seems that Context can help: labels can be stored as context values. Upper-level code just pass a context and lower-level code retrieve everything from this context.

This requires changes in shell-operator too.

Modules discovery process rethinking

Discovery process is now performed statically: it calculates enable state for all modules at once. The better way is to calculate first enabled module and run it, then calculate next enabled module and run it, go on until all modules are run. This, more dynamical algorithm, also helps resolve #16 and #43.

ModuleRun tasks can be queued while global hooks are run

A global hook can change module section in configmap/addon-operator and this leads to run an enabled script. It is OK when all global onStartup and kubernetes.Synchronization hooks are already run and it is terrible if enabled script depends on several global hook results (think of discovery).

One of possible solutions is to disable handling changes in module sections until the first DiscoverModuleState task is finished.

Secret values

All values are now in cm/addon-operator. That is bad for sensitive data.

Support Vault or something like this.

Implement metrics for failing values validations

Since we have validations of module configurations stored in ConfigMap which could be changed directly via kubectl - validation error may occur and would be visible only in the operator logs.

To reach observability of misconfigured module we might want to implement a metric, representing described condition.

This metric would be used in alerts, from which system operators will be notified about a fact of a configuration error.

Introduce names of different types of values for the documentation and further use

@diafour, what do you think if we introduce the following names and change the documentation appropriately:

  • global static values (/modules/values.yaml)
  • module static values (modules/123-some-module/values.yaml)
  • global config values
  • module config values
  • global dynamic values and global dynamic values patches
  • module dynamic values and module dynamic values patches
  • resulting global values
  • resulting module values
  • resulting values, or just values – resulting global and module values merged
    ?

And of course we shall briefly (but precisely and accessibly) describe each of this type of values, right in the beginning of the VALUES.md, and then use these names everywhere.

Non-informative error about failed enabled script

If enabled script is not exited with code 0, the error message hasn't enough information about what exactly enabled script that was.

{"level":"info",
  "msg":"Modules enabled by script: []",
  "operator.component":"moduleManager,discoverModulesState"}

{"event.id":"OperatorOnStartup","level":"error",
  "msg":"DiscoverModulesState failed, queue Delay task to retry. Failed count is 24. Error: exit status 1",
  "operator.component":"TaskRunner","task.id":"XXX","task.type":"DiscoverModulesState"}

Also, add more information to other errors in checkIsEnabledByScript method.

Allow use the external chart for module

The module can have no templates directory and have dependencies in Chart.yaml. The problem is that addon-operator construct values for helm in this form:

global:
  globalKey1: ...
  globalKey2: ...
  ...
moduleNameInCamelCase:
  moduleKey1: ...
  moduleKey2: ...

But external chart does not "expect" values in this form. Should we define values transformation in module.yaml (#39) or there is a better solution for this problem?

Global hooks with schedules do not work

Hi. I am trying to make global hook with schedule like that

cat global-hooks/check-update.sh 
#!/usr/bin/env bash
if [[ $1 == "--config" ]] ; then
  cat <<EOF
{
  "configVersion":"v1",
  "schedule": [
    {
      "name": "every 1 min",
      "crontab": "* * * * *"
    }
  ],
  "onStartup": 1
}
EOF
exit 0
fi

echo "I AM GLOBAL HOOK"
echo $(date) >> /tmp/date.txt

Only onStartup action starts script.

time="2020-02-29T08:27:30Z" level=info msg="I AM GLOBAL HOOK" binding=onStartup event.type=OperatorStartup hook=check-update.sh hook.type=global output=stdout queue=main task.id=e54568b1-449d-472d-8065-744bdf4b1554
time="2020-02-29T08:27:30Z" level=info msg="Global hook success" binding=onStartup event.type=OperatorStartup hook=check-update.sh hook.type=global operator.component=taskRunner queue=main task.id=e54568b1-449d-472d-8065-744bdf4b1554

Also if I make the script as module hook it works.

Related logs - https://pastebin.com/Vs01a3dT
There are also 2 modules, but they are disabled.

Is it supported right now?

debug command: module render

Add a command to render module helm templates with current merged values or ith values from file.

addon-operator module render <module-name> [--set stringArray] [--values file.yaml]

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.