datadog / watermarkpodautoscaler Goto Github PK

View Code? Open in Web Editor NEW

206.0 7.0 24.0 26.97 MB

Custom controller that extends the Horizontal Pod Autoscaler

License: Apache License 2.0

Dockerfile 0.50% Shell 6.22% Go 90.89% Makefile 1.87% Mustache 0.50% Smarty 0.02%

kubernetes autoscaling metrics hpa metrics-server

watermarkpodautoscaler's People

Stargazers

Watchers

watermarkpodautoscaler's Issues

multiple metrics with must and should requirements

I read the doc and did not yet get an idea what happens if i would combine multiple metrics even of the same type
external and resource.
What is possible and how will it behave? Given these examples

1.)
I think i understood if you use type external only one is allowed.

 - external:
      highWatermark: 400m
      lowWatermark: 150m
      metricName: custom.request_duration.max
      metricSelector:
        matchLabels:
          app: {{ .Chart.Name }}
          release: {{ .Release.Name }}

2.) But what if use resource? Can i have two metrics e.g.

 - Resource:
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 85%
    type: Resource
- Resource:
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 90%
    type: Resource

And what happens if memory is at 90% and cpu at 50%? Will it scale up or down?

3.) And if i even mix kinds?

1. prior never downscale based on mem but make sure memory stays below 85% always
1. prio scale dynamically based on cores utlization
  Which one of a. and b. will full above requirement?

a.)

 - external:
      highWatermark: 90
      lowWatermark: 60
      metricName: kubernetes.cpu.usage
      metricSelector:
        matchLabels:
          app: {{ .Chart.Name }}
          release: {{ .Release.Name }}
 - Resource:
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 85%
    type: Resource

b.)

 - external: 
      highWatermark: 85
      lowWatermark: 0
      metricName: kubernetes.ememoty.usage
      metricSelector:
        matchLabels:
          app: {{ .Chart.Name }}
          release: {{ .Release.Name }}
 - Resource:
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 85%
    type: Resource

4.) What if i mix and create contra dictionary requirements

 - external:
      highWatermark: 80
      lowWatermark: 40
      metricName: kubernetes.cpu.usage
      metricSelector:
        matchLabels:
          app: {{ .Chart.Name }}
          release: {{ .Release.Name }}
- Resource:
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 90%
    type: Resource

WPA controller was unable to update the number of replicas: status.currentMetrics in body must be of type array: "null"

Hello 👋

We're trying to take the WPA into use, but we keep seeing this error message:

{
	"level": "error",
	"ts": 1594040003.47035,
	"logger": "wpa_controller",
	"msg": "The WPA controller was unable to update the number of replicas",
	"Request.Namespace": "web-services-staging",
	"Request.Name": "nextapi",
	"error": "WatermarkPodAutoscaler.datadoghq.com \"nextapi\" is invalid: []: Invalid value: map[string]interface {}{\"apiVersion\":\"datadoghq.com/v1alpha1\", \"kind\":\"WatermarkPodAutoscaler\", \"metadata\":map[string]interface {}{\"annotations\":map[string]interface {}{\"meta.helm.sh/release-name\":\"nextapi\", \"meta.helm.sh/release-namespace\":\"web-services-staging\"}, \"creationTimestamp\":\"2020-07-06T12:49:21Z\", \"generation\":2, \"labels\":map[string]interface {}{\"app\":\"nextapi\", \"app.kubernetes.io/managed-by\":\"Helm\", \"chart\":\"mozart-0.4.0\", \"env\":\"staging\", \"heritage\":\"Helm\", \"region\":\"eu-west-1\", \"release\":\"nextapi\", \"stage\":\"staging\"}, \"name\":\"nextapi\", \"namespace\":\"web-services-staging\", \"resourceVersion\":\"124372418\", \"uid\":\"51196075-0345-4839-8526-3cf805be0376\"}, \"spec\":map[string]interface {}{\"algorithm\":\"absolute\", \"downscaleForbiddenWindowSeconds\":60, \"maxReplicas\":50, \"metrics\":[]interface {}{map[string]interface {}{\"external\":map[string]interface {}{\"highWatermark\":\"1\", \"lowWatermark\":\"0\", \"metricName\":\"php_fpm.listen_queue.size\", \"metricSelector\":map[string]interface {}{\"matchLabels\":map[string]interface {}{\"app\":\"nextapi\", \"region\":\"eu-west-1\", \"stage\":\"staging\"}}}, \"type\":\"External\"}}, \"minReplicas\":2, \"scaleDownLimitFactor\":30, \"scaleTargetRef\":map[string]interface {}{\"apiVersion\":\"apps/v1\", \"kind\":\"Deployment\", \"name\":\"nextapi\"}, \"scaleUpLimitFactor\":50, \"tolerance\":0.01, \"upscaleForbiddenWindowSeconds\":30}, \"status\":map[string]interface {}{\"conditions\":[]interface {}{map[string]interface {}{\"lastTransitionTime\":\"2020-07-06T12:53:23Z\", \"message\":\"Scaling changes can be applied\", \"reason\":\"DryRun mode disabled\", \"status\":\"False\", \"type\":\"DryRun\"}, map[string]interface {}{\"lastTransitionTime\":\"2020-07-06T12:53:23Z\", \"message\":\"the WPA controller was able to get the target's current scale\", \"reason\":\"SucceededGetScale\", \"status\":\"True\", \"type\":\"AbleToScale\"}, map[string]interface {}{\"lastTransitionTime\":\"2020-07-06T12:53:23Z\", \"message\":\"the HPA was unable to compute the replica count: unable to get external metric web-services-staging/php_fpm.listen_queue.size/&LabelSelector{MatchLabels:map[string]string{app: nextapi,region: eu-west-1,stage: staging,},MatchExpressions:[],}: unable to fetch metrics from external metrics API: the server is currently unable to handle the request (get php_fpm.listen_queue.size.external.metrics.k8s.io)\", \"reason\":\"FailedGetExternalMetric\", \"status\":\"False\", \"type\":\"ScalingActive\"}}, \"currentMetrics\":interface {}(nil), \"currentReplicas\":2, \"desiredReplicas\":0}}: validation failure list:\nstatus.currentMetrics in body must be of type array: \"null\"",
	"stacktrace": "github.com/go-logr/zapr.(*zapLogger).Error\n\twatermarkpodautoscaler/vendor/github.com/go-logr/zapr/zapr.go:128\ngithub.com/DataDog/watermarkpodautoscaler/pkg/controller/watermarkpodautoscaler.(*ReconcileWatermarkPodAutoscaler).reconcileWPA\n\twatermarkpodautoscaler/pkg/controller/watermarkpodautoscaler/watermarkpodautoscaler_controller.go:428\ngithub.com/DataDog/watermarkpodautoscaler/pkg/controller/watermarkpodautoscaler.(*ReconcileWatermarkPodAutoscaler).Reconcile\n\twatermarkpodautoscaler/pkg/controller/watermarkpodautoscaler/watermarkpodautoscaler_controller.go:344\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\twatermarkpodautoscaler/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:216\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\twatermarkpodautoscaler/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:192\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\twatermarkpodautoscaler/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:171\nk8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\twatermarkpodautoscaler/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:152\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\twatermarkpodautoscaler/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:153\nk8s.io/apimachinery/pkg/util/wait.Until\n\twatermarkpodautoscaler/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88"
}

This is our WPA definition:

apiVersion: datadoghq.com/v1alpha1
kind: WatermarkPodAutoscaler
metadata:
  name: nextapi
  labels:
    app: nextapi
    chart: mozart-0.4.0
    release: nextapi
    heritage: Helm
    env: staging
    region: eu-west-1
    stage: staging
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: nextapi
  downscaleForbiddenWindowSeconds: 60
  upscaleForbiddenWindowSeconds: 30
  scaleDownLimitFactor: 30
  scaleUpLimitFactor: 50
  minReplicas: 2
  maxReplicas: 50
  metrics:
    - external:
        highWatermark: "1"
        lowWatermark: "0"
        metricName: php_fpm.listen_queue.size
        metricSelector:
          matchLabels:
            app: nextapi
            stage: staging
            region: eu-west-1
      type: External
  tolerance: 0.01

Question: What does "only officially supports one metric per WPA" mean?

I see in the readme that only one metric per WPA resource is officially supported. Can you expand upon that a little bit, what are the potential issues with trying to use multiple metrics per? Is it that they can potentially contradict each other on scale up/down behavior?

Change dryRun value using helm chart

Describe what happened:
We deploy WPA object for our services using a helm chart. This chart contains a manifest template that manage the WPA objects.
Whatever is the value we define for dryRun attribute (false/true, on/off, empty value) if someone set it to true using kubectl, the value is never reset to false when we update the chart release.

Describe what you expected:
Setting dryRun to false in the WPA manifest should set the value to false when the manifest is applied by helm.
This works fine using the kubectl providen in the README file

Steps to reproduce the issue:

Deploy a WPA with dryRun=false
change dryRun to true using kubectl
Deploy the WPA again with the same manifest.
The value will still be true.

Additional environment details (Kubernetes version, etc):

latest release datadog chart
latest release WPA chart
kubernetes version 1.21

documentation: quantities and metric relation, case sensitiveness, metric context/selection

quantities and metric relation

Taken from the docs..
"They are specified as Quantities, so you can use m | "" | k | M | G | T | P | E to easily represent the value you want to use."

The above sentence quotes that highWatermark and lowWatermark values can or must? be extended by quantity.
I assume its dependent on the type of metric.

What type of quantity is "" ?
What about percent values like i think it shall be able to use all kind of metrics which have a defined max and a min value like e.g kubernetes.memory.usage_pct or kubernetes.cpu.usage_pct. Do i need to set only the value 90 or 90% ?

If i don't add a quantity but just the number, will it take a default (which) or will it fail?

case sensitiveness

There are different metrics i can use . eg.

The docker Metrics e.g "Memory" have upper case. Is there generally any case sensitiveness ?

metric context/selection

Taking above two metric sources - system/docker and kubernetes - i could use
similar metrics from both e.g system.mem.used vs kubernetes.memory.usage
If i deploy a resource wpa

      metricSelector:
        matchLabels:
          app: {{ .Chart.Name }}
          release: {{ .Release.Name }}

will the both metric measurements be scoped to the container level per pod or do they have different scopes and which?

Which of these might be the better choice to control the scale

WPA support for openshift 3.11 + k8s 1.11

Can WPA support openshiift 3.11 or kubernetes 1.11?
I try to apply WPA on okd3.11 and I got the following error
must only have "properties", "required" or "description" at the root if the status subresource is enabled

I tried to mark "subresources" which is in CRD WatermarkPodAutoscaler and it could be deployed successfully.
file datadoghq.com_watermarkpodautoscalers_crd.yaml

 33     shortNames:
 34     - wpa
 35     singular: watermarkpodautoscaler
 36   scope: Namespaced
 37   subresources: # delete these two lines
 38     status: {} # delete these two lines
 39   validation:
 40     openAPIV3Schema:
 41       description: WatermarkPodAutoscaler is the Schema for the watermarkpodautoscalers
 42         API

But I face a problem that datadog cluster didn't detect wpa created and didn't collect custom metrics from Datadog server when I add WPA as below

  1 apiVersion: datadoghq.com/v1alpha1
  2 kind: WatermarkPodAutoscaler
  3 metadata:
  4   name: consumer
  5   namespace: kafka-project
  6 spec:
  7   # Add fields here
  8   algorithm: average
  9   maxReplicas: 15
 10   minReplicas: 1
 11   tolerance: 0.01
 12   downscaleForbiddenWindowSeconds: 300
 13   upscaleForbiddenWindowSeconds: 15
 14   scaleUpLimitFactor: 50
 15   scaleDownLimitFactor: 20
 16   scaleTargetRef:
 17     kind: Deployment
 18     apiVersion: apps/v1
 19     name: consumer
 20   readinessDelay: 10
 21   metrics:
 22   # Resource or External type supported
 23   # Example usage of External type
 24   - type: External
 25     external:
 26       highWatermark: "1"
 27       lowWatermark: "1"
 28       metricName: <metrics_name>
 29       metricSelector:
 30         matchLabels:
 31           kube_deployment: consumer
 32           kube_namespace: kafka-project

Erorr log is below

Datadog cluster agent

2020-05-12 12:13:04 UTC | CLUSTER | DEBUG | (pkg/aggregator/aggregator.go:554 in sendEvents) | Flushing 1 events to the forwarder
2020-05-12 12:13:04 UTC | CLUSTER | DEBUG | (pkg/aggregator/aggregator.go:393 in pushSeries) | Flushing 2 series to the forwarder
2020-05-12 12:13:04 UTC | CLUSTER | DEBUG | (pkg/aggregator/aggregator.go:506 in sendServiceChecks) | Flushing 5 service checks to the forwarder
2020-05-12 12:13:04 UTC | CLUSTER | DEBUG | (pkg/serializer/split/split.go:77 in Payloads) | The payload was not too big, returning the full payload
2020-05-12 12:13:04 UTC | CLUSTER | DEBUG | (pkg/serializer/split/split.go:77 in Payloads) | The payload was not too big, returning the full payload
2020-05-12 12:13:04 UTC | CLUSTER | DEBUG | (pkg/serializer/split/split.go:77 in Payloads) | The payload was not too big, returning the full payload
2020-05-12 12:13:05 UTC | CLUSTER | DEBUG | (pkg/collector/runner/runner.go:263 in work) | Running check kubernetes_apiserver
2020-05-12 12:13:05 UTC | CLUSTER | DEBUG | (pkg/util/kubernetes/apiserver/leaderelection/leaderelection.go:164 in EnsureLeaderElectionRuns) | Currently Leader: true. Leader identity: "datadog-cluster-agent-59858975fd-98rfr"
2020-05-12 12:13:05 UTC | CLUSTER | DEBUG | (pkg/util/kubernetes/apiserver/common/common.go:23 in GetResourcesNamespace) | No configured namespace for the resource, fetching from the current context
2020-05-12 12:13:05 UTC | CLUSTER | DEBUG | (pkg/util/kubernetes/apiserver/events.go:55 in RunEventCollection) | Starting to watch from 60726555
2020-05-12 12:13:07 UTC | CLUSTER | DEBUG | (pkg/util/kubernetes/apiserver/events.go:113 in RunEventCollection) | Collected 2 events, will resume watching from resource version 60726597
2020-05-12 12:13:07 UTC | CLUSTER | DEBUG | (pkg/util/kubernetes/apiserver/common/common.go:23 in GetResourcesNamespace) | No configured namespace for the resource, fetching from the current context
2020-05-12 12:13:07 UTC | CLUSTER | DEBUG | (pkg/util/kubernetes/apiserver/apiserver.go:328 in UpdateTokenInConfigmap) | Updated event.tokenKey to 60726597 in the ConfigMap datadogtoken
2020-05-12 12:13:07 UTC | CLUSTER | DEBUG | (pkg/collector/runner/runner.go:329 in work) | Done running check kubernetes_apiserver
2020-05-12 12:13:08 UTC | CLUSTER | DEBUG | (pkg/clusteragent/custommetrics/provider.go:196 in GetExternalMetric) | External metrics returned: []external_metrics.ExternalMetricValue{}

WPA controller

E0512 12:10:52.718448       1 memcache.go:199] couldn't get resource list for external.metrics.k8s.io/v1beta1: Got empty response for: external.metrics.k8s.io/v1beta1
{"level":"info","ts":1589285452.7814093,"logger":"wpa_controller","msg":"Target deploy","Request.Namespace":"kafka-project","Request.Name":"consumer","replicas":2}
{"level":"error","ts":1589285452.7956553,"logger":"wpa_controller","msg":"The WPA controller was unable to update the number of replicas","Request.Namespace":"kafka-project","Request.Name":"consumer","error":"the server could not find the requested resource (put watermarkpodautoscalers.datadoghq.com consumer)","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\twatermarkpodautoscaler/vendor/github.com/go-logr/zapr/zapr.go:128\ngithub.com/DataDog/watermarkpodautoscaler/pkg/controller/watermarkpodautoscaler.(*ReconcileWatermarkPodAutoscaler).reconcileWPA\n\twatermarkpodautoscaler/pkg/controller/watermarkpodautoscaler/watermarkpodautoscaler_controller.go:428\ngithub.com/DataDog/watermarkpodautoscaler/pkg/controller/watermarkpodautoscaler.(*ReconcileWatermarkPodAutoscaler).Reconcile\n\twatermarkpodautoscaler/pkg/controller/watermarkpodautoscaler/watermarkpodautoscaler_controller.go:344\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\twatermarkpodautoscaler/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:216\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\twatermarkpodautoscaler/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:192\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\twatermarkpodautoscaler/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:171\nk8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\twatermarkpodautoscaler/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:152\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\twatermarkpodautoscaler/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:153\nk8s.io/apimachinery/pkg/util/wait.Until\n\twatermarkpodautoscaler/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88"}
{"level":"info","ts":1589285467.796208,"logger":"wpa_controller","msg":"Reconciling WatermarkPodAutoscaler","Request.Namespace":"kafka-project","Request.Name":"consumer"}
{"level":"info","ts":1589285467.8238223,"logger":"wpa_controller","msg":"Target deploy","Request.Namespace":"kafka-project","Request.Name":"consumer","replicas":2}
{"level":"error","ts":1589285467.8450387,"logger":"wpa_controller","msg":"The WPA controller was unable to update the number of replicas","Request.Namespace":"kafka-project","Request.Name":"consumer","error":"the server could not find the requested resource (put watermarkpodautoscalers.datadoghq.com consumer)","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\twatermarkpodautoscaler/vendor/github.com/go-logr/zapr/zapr.go:128\ngithub.com/DataDog/watermarkpodautoscaler/pkg/controller/watermarkpodautoscaler.(*ReconcileWatermarkPodAutoscaler).reconcileWPA\n\twatermarkpodautoscaler/pkg/controller/watermarkpodautoscaler/watermarkpodautoscaler_controller.go:428\ngithub.com/DataDog/watermarkpodautoscaler/pkg/controller/watermarkpodautoscaler.(*ReconcileWatermarkPodAutoscaler).Reconcile\n\twatermarkpodautoscaler/pkg/controller/watermarkpodautoscaler/watermarkpodautoscaler_controller.go:344\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\twatermarkpodautoscaler/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:216\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\twatermarkpodautoscaler/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:192\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\twatermarkpodautoscaler/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:171\nk8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\twatermarkpodautoscaler/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:152\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\twatermarkpodautoscaler/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:153\nk8s.io/apimachinery/pkg/util/wait.Until\n\twatermarkpodautoscaler/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88"}

Is there a workaround I can do for supporting k8s 1.11? Could you help to support k8s 1.11?
Thanks for your help.

Datadog API rate limits

Since this controller extracts metrics using the Datadog API I would like to know how this can be brought in line with the low API rate-limits for these kind of calls (source: https://docs.datadoghq.com/api/#rate-limiting):

The rate limit for metric retrieval is 100 per hour per organization
The rate limit for the query_batch API [...] call is 300 per hour per organization.

Additionally there is no way to monitor the current rate-limit budget so it just starts failing silently.

There should at least be some kind of warning in the README or information on what to do about this.

watermarkpodautoscaler": executable file not found in $PATH: unknown

21s Warning Failed pod/watermarkpodautoscaler-66d6d96c96-9ms4b Error: failed to create containerd task: OCI runtime create failed: container_linux.go:370: starting container process caused: exec: "watermarkpodautoscaler": executable file not found in $PATH: unknown

WPA will get panic if we try to scale Openshift DeploymentConfig

Describe what happened:
WPA try to scale-out Openshift DeploymentConfig but get the following error

{"level":"info","ts":1600848500.4110653,"logger":"wpa_controller","msg":"Reconciling WatermarkPodAutoscaler","Request.Namespace":"nginx-preloader-sample","Request.Name":"wpa4"}
{"level":"error","ts":1600848500.412769,"logger":"wpa_controller","msg":"RunTime error in reconcileWPA","Request.Namespace":"nginx-preloader-sample","Request.Name":"wpa4","returnValue":"runtime error: invalid memory address or nil pointer dereference","error":"recover error","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\twatermarkpodautoscaler/vendor/github.com/go-logr/zapr/zapr.go:128\ngithub.com/DataDog/watermarkpodautoscaler/pkg/controller/watermarkpodautoscaler.(*ReconcileWatermarkPodAutoscaler).reconcileWPA.func1\n\twatermarkpodautoscaler/pkg/controller/watermarkpodautoscaler/watermarkpodautoscaler_controller.go:360\nruntime.gopanic\n\t/usr/local/Cellar/go/1.13.4/libexec/src/runtime/panic.go:679\nruntime.panicmem\n\t/usr/local/Cellar/go/1.13.4/libexec/src/runtime/panic.go:199\nruntime.sigpanic\n\t/usr/local/Cellar/go/1.13.4/libexec/src/runtime/signal_unix.go:394\ngithub.com/DataDog/watermarkpodautoscaler/pkg/controller/watermarkpodautoscaler.(*ReconcileWatermarkPodAutoscaler).reconcileWPA\n\twatermarkpodautoscaler/pkg/controller/watermarkpodautoscaler/watermarkpodautoscaler_controller.go:379\ngithub.com/DataDog/watermarkpodautoscaler/pkg/controller/watermarkpodautoscaler.(*ReconcileWatermarkPodAutoscaler).Reconcile\n\twatermarkpodautoscaler/pkg/controller/watermarkpodautoscaler/watermarkpodautoscaler_controller.go:344\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\twatermarkpodautoscaler/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:216\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\twatermarkpodautoscaler/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:192\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\twatermarkpodautoscaler/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:171\nk8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\twatermarkpodautoscaler/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:152\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\twatermarkpodautoscaler/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:153\nk8s.io/apimachinery/pkg/util/wait.Until\n\twatermarkpodautoscaler/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88"}

WPA.yaml

apiVersion: datadoghq.com/v1alpha1
kind: WatermarkPodAutoscaler
metadata:
  name: wpa4
  namespace: nginx-preloader-sample
spec:
  algorithm: average
  maxReplicas: 20
  minReplicas: 1
  tolerance: 0.01
  downscaleForbiddenWindowSeconds: 300
  upscaleForbiddenWindowSeconds: 15
  scaleUpLimitFactor: 90
  scaleDownLimitFactor: 90
  scaleTargetRef:
    kind: DeploymentConfig
    apiVersion: apps.openshift.io/v1
    name: nginx-prepared
  readinessDelay: 10
  metrics:
  - type: External
    external:
      highWatermark: "1"
      lowWatermark: "1"
      metricName: federatorai.recommendation
      metricSelector:
        matchLabels:
          resource: replicas
          kube_cluster: jason-4-115
          oshift_deployment_config: nginx-prepared
          kube_namespace: nginx-preloader-sample

Cluster Agent can get external metrics

  * watermark pod autoscaler: nginx-preloader-sample/wpa4
    Metric name: federatorai.recommendation
    Labels:
    - kube_cluster: jason-4-115
    - kube_namespace: nginx-preloader-sample
    - oshift_deployment_config: nginx-prepared
    - resource: replicas
    Value: 6
    Timestamp: 2020-09-23 08:18:00.000000 UTC
    Valid: true

Other WPA works as we expected

  * watermark pod autoscaler: myproject/wpa3
    Metric name: federatorai.recommendation
    Labels:
    - kube_cluster: jason-4-115
    - kube_deployment: consumer3
    - kube_namespace: myproject
    - resource: replicas
    Value: 7
    Timestamp: 2020-09-23 08:25:00.000000 UTC
    Valid: true

WPA yaml

apiVersion: datadoghq.com/v1alpha1
kind: WatermarkPodAutoscaler
metadata:
  name: wpa3
  namespace: myproject
spec:
  # Add fields here
  # algorithm must be average
  algorithm: average
  maxReplicas: 15
  minReplicas: 1
  tolerance: 0.01
  downscaleForbiddenWindowSeconds: 300
  upscaleForbiddenWindowSeconds: 15
  scaleUpLimitFactor: 90
  scaleDownLimitFactor: 90
  scaleTargetRef:
    kind: Deployment
    apiVersion: apps/v1
    name: consumer3
  readinessDelay: 10
  metrics:
  # Resource or External type supported
  # Example usage of External type
  - type: External
    external:
      # do not edit highWatermakr, and lowWatermark
      # highWatermark and lowWatermark must be 1
      highWatermark: "1"
      lowWatermark: "1"
      metricName: federatorai.recommendation
      metricSelector:
        matchLabels:
          resource: replicas
          kube_cluster: jason-4-115
          kube_deployment: consumer3
          kube_namespace: myproject

WPA log

{"level":"info","ts":1600849222.9556293,"logger":"wpa_controller","msg":"Successful rescale","Request.Namespace":"myproject","Request.Name":"wpa3","currentReplicas":6,"desiredReplicas":7,"rescaleReason":"federatorai.recommendation{map[kube_cluster:jason-4-115 kube_deployment:consumer3 kube_namespace:myproject resource:replicas]} above target"}

Describe what you expected:
WPA can scale-out Openshift DeploymentConfig successfully.

Steps to reproduce the issue:

Additional environment details (Kubernetes version, etc):
openshift v3.11.0+8f721f2-450
kubernetes v1.11.0+d4cacc0

WPA image
image: datadog/watermarkpodautoscaler:v0.1.0

host images in docker hub

"docker.io/datadog/watermarkpodautoscaler:v0.3.0-rc5": failed to resolve
reference "docker.io/datadog/watermarkpodautoscaler:v0.3.0-rc5": docker.io/datadog/watermarkpodautoscaler:v0.3.0-rc5:

docs: external_metrics vs custometric

Trying to understand the difference or historical evolution? from A.) external_metrics with embedded query as part of the wpa resource spec to B.) external_metrics referencing custom metric defined as dedicated

A.)

apiVersion: datadoghq.com/v1alpha1
kind: WatermarkPodAutoscaler
metadata:
  name: {{ .Chart.Name }}
  namespace: {{ .Release.Namespace }}
spec:
  metrics:
    - type: External
      external:
      metricName: "<METRIC_NAME>"
      metricSelector:
        matchLabels:
          <TAG_KEY>: <TAG_VALUE>

B.)

apiVersion: datadoghq.com/v1alpha1
kind: DatadogMetric
metadata:
  name: your_datadogmetric_name
  namespace: {{ .Release.Namespace }}
  labels:
    {{- include "labels" . | indent 4 }}
spec:
  query: avg:kubernetes.cpu.usage{app:myapp,release:myapp}.rollup(30)

apiVersion: datadoghq.com/v1alpha1
kind: WatermarkPodAutoscaler
metadata:
  name: {{ .Chart.Name }}
  namespace: {{ .Release.Namespace }}
spec:
  metrics:
    - type: External
      external:
        metricName: "datadogmetric@{{ .Release.Namespace }}:your_datadogmetric_name"

https://docs.datadoghq.com/agent/cluster_agent/external_metrics/#set-up-an-hpa-to-use-a-datadogmetric-object

1.) Is it like that B is the new version to spec metrics used in wpa instead of A since Kubernetes v1.2 allows such?
2.) Why does B outweights A, in features, does it? Is A no longer best practice or even to be sundowned?
3.) Can instead defining a emtric resource via k8 manifest also use the beta datadog ui feature to create custom metric as "your_datadogmetric_name" and it is then referenceable in any wpa resource spec as well?
4.) If i change the query live in ui for an already depliyed wpa using it, how fast will it be pulled?
5.) Does the in the query used "labels" will match always to the label i specified on the application i want to apply the metric filter on (e.g deployemnt.metadata.labels ?
6.) Creating new custom metric either via UI or k8 manifest - which tags can they filter on (e.g pods, deployments, deamons set)

Standby pods are panicking

Describe what happened:

When I run the autoscaler in HA, having a second pod in standby, the second pod will log out that it is standing by a few times and then exit. Eventually the pod enters a CrashBackoffLoop state, which means it will not actually be standing by part of the time.

What I see in the pod describe output is:

    Last State:     Terminated
      Reason:       Error
      Exit Code:    2
      Started:      Wed, 17 Mar 2021 11:54:57 -0400
      Finished:     Wed, 17 Mar 2021 11:55:27 -0400
    Ready:          False
    Restart Count:  3258

What I see in the logs:

{"level":"info","ts":1615997228.2988877,"logger":"cmd","msg":"Version: v0.2.0-dirty"}
{"level":"info","ts":1615997228.2989118,"logger":"cmd","msg":"Build time: 2020-09-09/20:03:05"}
{"level":"info","ts":1615997228.2989151,"logger":"cmd","msg":"Git tag: v0.2.0"}
{"level":"info","ts":1615997228.298917,"logger":"cmd","msg":"Git Commit: 3c5176693cdf2838c54298fb6f732c4ac21dbe86"}
{"level":"info","ts":1615997228.2989194,"logger":"cmd","msg":"Go Version: go1.13.15"}
{"level":"info","ts":1615997228.2989216,"logger":"cmd","msg":"Go OS/Arch: linux/amd64"}
{"level":"info","ts":1615997228.2989237,"logger":"cmd","msg":"Version of operator-sdk: v0.13.0"}
{"level":"info","ts":1615997228.299064,"logger":"leader","msg":"Trying to become the leader."}
{"level":"info","ts":1615997229.3133864,"logger":"leader","msg":"Found existing lock","LockOwner":"watermarkpodautoscaler-69cc854fbf-dqjbg"}
{"level":"info","ts":1615997229.3252141,"logger":"leader","msg":"Not the leader. Waiting."}
{"level":"info","ts":1615997230.456776,"logger":"leader","msg":"Not the leader. Waiting."}
{"level":"info","ts":1615997232.8421242,"logger":"leader","msg":"Not the leader. Waiting."}
{"level":"info","ts":1615997237.3827972,"logger":"leader","msg":"Not the leader. Waiting."}
{"level":"info","ts":1615997246.0937243,"logger":"leader","msg":"Not the leader. Waiting."}

Describe what you expected:

I expect that the pod can remain online and continue to check if it can become leader without panicking.

Steps to reproduce the issue:

Run 2 or more autoscaler pods in the same cluster, with leader election enabled
Wait a few minutes
Check on standby pod state

Additional environment details (Kubernetes version, etc):

Autoscaler image: datadog/watermarkpodautoscaler:v0.2.0, exact commit hash is in the logs above

I'm seeing this behaviour in multiple clusters, different kubernetes versions:

Server Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.9", GitCommit:"94f372e501c973a7fa9eb40ec9ebd2fe7ca69848", GitTreeState:"clean", BuildDate:"2020-09-16T13:47:43Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}

Server Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.7", GitCommit:"6c143d35bb11d74970e7bc0b6c45b6bfdffc0bd4", GitTreeState:"clean", BuildDate:"2019-12-11T12:34:17Z", GoVersion:"go1.12.12", Compiler:"gc", Platform:"linux/amd64"}

why not support cpu metrics？

thanks datadog. this is a exciting project. but why not support cpu metrics?

unable to fetch metrics from external metrics API: Internal error occurred: DatadogMetric is invalid, err: Invalid metric (from backend)

{"level":"info","ts":1624011129.786673,"logger":"controllers.WatermarkPodAutoscaler","msg":"Failed to compute desired number of replicas based on listed metrics.","watermarkpodautoscaler":"dev/myapp","reference":"Deployment/dev/myapp","error":"failed to get external metric kubernetes.cpu.usage: unable to get external metric dev/kubernetes.cpu.usage/&LabelSelector{MatchLabels:map[string]string{app: myapp,release: myapp,},MatchExpressions:[]LabelSelectorRequirement{},}: unable to fetch metrics from external metrics API: Internal error occurred: DatadogMetric is invalid, err: Invalid metric (from backend), query: avg:kubernetes.cpu.usage{app:myapp,release:myapp}.rollup(30)"}
{"level":"info","ts":1624011144.793971,"logger":"controllers.WatermarkPodAutoscaler","msg":"Target deploy","watermarkpodautoscaler":"dev/myapp","replicas":2}
{"level":"info","ts":1624011144.7941537,"logger":"controllers.WatermarkPodAutoscaler","msg":"getReadyPodsCount","watermarkpodautoscaler":"dev/myapp","full podList length":2,"toleratedAsReadyPodCount":2,"incorrectly targeted pods":0}
{"level":"info","ts":1624011144.8295028,"logger":"controllers.WatermarkPodAutoscaler","msg":"Failed to compute desired number of replicas based on listed metrics.","watermarkpodautoscaler":"dev/myapp","reference":"Deployment/dev/myapp","error":"failed to get external metric kubernetes.cpu.usage: unable to get external metric dev/kubernetes.cpu.usage/&LabelSelector{MatchLabels:map[string]string{app: myapp,release: myapp,},MatchExpressions:[]LabelSelectorRequirement{},}: unable to fetch metrics from external metrics API: Internal error occurred: DatadogMetric is invalid, err: Invalid metric (from backend), query: avg:kubernetes.cpu.usage{app:myapp,release:myapp}.rollup(30)"}
{"level":"info","ts":1624011159.8374639,"logger":"controllers.WatermarkPodAutoscaler","msg":"Target deploy","watermarkpodautoscaler":"dev/myapp","replicas":2}

Ask for metric (without tags)

% kubectl get --raw "/apis/external.metrics.k8s.io/v1beta1/namespaces/dev/kubernetes.cpu.usage | jq ."
Error from server (InternalError): Internal error occurred: DatadogMetric not found for metric name: kubernetes.cpu.usage | jq ., datadogmetricid: datadog/dcaautogen-646a73ad876299907eb8035a2fa8e2b60ac832

Impersonate and ask for any metric

% kubectl get --raw "/apis/external.metrics.k8s.io/v1beta1/namespaces/dev/metric --as system:serviceaccount:dev:watermarkpodautoscaler"
Error from server (InternalError): Internal error occurred: DatadogMetric not found for metric name: metric --as system:serviceaccount:dev:watermarkpodautoscaler, datadogmetricid: datadog/dcaautogen-98eae079d0f4a80135d6f4b6f9762cea878b97

What did i wrong?
How can i general find out which metrics and which tags for it are available?

Constant errors

I went ahead and applied the contents of deploy/ to get a test of this running. Created a wpa for a deployment.

I got various errors regarding lacking rbac permissions in the provided role added a few (deployment,replicaset,statefulset get/list/watch, service, create/update/get). That didn't solve problems though. Checked and saw that your provided yamls apply version v0.0.1, went ahead and set it up to use v0.1.0 instead. Now I'm getting the following errors:

{"level":"info","ts":1578589126.4794915,"logger":"cmd","msg":"Version: 0.0.1"}
{"level":"info","ts":1578589126.4795249,"logger":"cmd","msg":"Build time: "}
{"level":"info","ts":1578589126.4795303,"logger":"cmd","msg":"Git tag: "}
{"level":"info","ts":1578589126.4795349,"logger":"cmd","msg":"Git Commit: "}
{"level":"info","ts":1578589126.4795387,"logger":"cmd","msg":"Go Version: go1.13.4"}
{"level":"info","ts":1578589126.4795427,"logger":"cmd","msg":"Go OS/Arch: linux/amd64"}
{"level":"info","ts":1578589126.4795468,"logger":"cmd","msg":"Version of operator-sdk: v0.12.0"}
{"level":"info","ts":1578589126.479719,"logger":"leader","msg":"Trying to become the leader."}
{"level":"info","ts":1578589127.4483173,"logger":"leader","msg":"No pre-existing lock was found."}
{"level":"info","ts":1578589127.453766,"logger":"leader","msg":"Became the leader."}
{"level":"info","ts":1578589128.408475,"logger":"controller-runtime.metrics","msg":"metrics server is starting to listen","addr":"0.0.0.0:8383"}
{"level":"info","ts":1578589128.4087286,"logger":"cmd","msg":"Registering Components."}
{"level":"info","ts":1578589128.4106596,"logger":"controller-runtime.controller","msg":"Starting EventSource","controller":"watermarkpodautoscaler-controller","source":"kind source: /, Kind="}
{"level":"info","ts":1578589128.4107704,"logger":"cmd","msg":"Starting the Cmd."}
{"level":"info","ts":1578589128.4110043,"logger":"controller-runtime.manager","msg":"starting metrics server","path":"/metrics"}
{"level":"info","ts":1578589128.5111277,"logger":"controller-runtime.controller","msg":"Starting Controller","controller":"watermarkpodautoscaler-controller"}
{"level":"info","ts":1578589128.6113448,"logger":"controller-runtime.controller","msg":"Starting workers","controller":"watermarkpodautoscaler-controller","worker count":1}
{"level":"info","ts":1578589128.6114364,"logger":"wpa_controller","msg":"Reconciling WatermarkPodAutoscaler","Request.Namespace":"default","Request.Name":"statsdgenerator-wpa"}
E0109 16:58:49.515122       1 memcache.go:199] couldn't get resource list for external.metrics.k8s.io/v1beta1: Got empty response for: external.metrics.k8s.io/v1beta1
{"level":"error","ts":1578589129.5824547,"logger":"wpa_controller","msg":"RunTime error in reconcileWPA","Request.Namespace":"default","Request.Name":"statsdgenerator-wpa","returnValue":"runtime error: invalid memory address or nil pointer dereference","error":"recover error","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\twatermarkpodautoscaler/vendor/github.com/go-logr/zapr/zapr.go:128\ngithub.com/DataDog/watermarkpodautoscaler/pkg/controller/watermarkpodautoscaler.(*ReconcileWatermarkPodAutoscaler).reconcileWPA.func1\n\twatermarkpodautoscaler/pkg/controller/watermarkpodautoscaler/watermarkpodautoscaler_controller.go:360\nruntime.gopanic\n\t/usr/local/Cellar/go/1.13.4/libexec/src/runtime/panic.go:679\nruntime.panicmem\n\t/usr/local/Cellar/go/1.13.4/libexec/src/runtime/panic.go:199\nruntime.sigpanic\n\t/usr/local/Cellar/go/1.13.4/libexec/src/runtime/signal_unix.go:394\ngithub.com/DataDog/watermarkpodautoscaler/pkg/controller/watermarkpodautoscaler.(*ReconcileWatermarkPodAutoscaler).reconcileWPA\n\twatermarkpodautoscaler/pkg/controller/watermarkpodautoscaler/watermarkpodautoscaler_controller.go:379\ngithub.com/DataDog/watermarkpodautoscaler/pkg/controller/watermarkpodautoscaler.(*ReconcileWatermarkPodAutoscaler).Reconcile\n\twatermarkpodautoscaler/pkg/controller/watermarkpodautoscaler/watermarkpodautoscaler_controller.go:344\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\twatermarkpodautoscaler/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:216\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\twatermarkpodautoscaler/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:192\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\twatermarkpodautoscaler/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:171\nk8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\twatermarkpodautoscaler/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:152\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\twatermarkpodautoscaler/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:153\nk8s.io/apimachinery/pkg/util/wait.Until\n\twatermarkpodautoscaler/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88"}

Thoughts?

Kubernetes 1.22 compatibility

It seems that K8s 1.22 is not compatible with WPA since your CRDs still use v1beta1 apiVersion: apiextensions.k8s.io/v1beta1

Link: CRD

link to Kubernetes documentation:
Deprecated API Migration Guide

WPA doesn't build on M1

hack/install-wwhrd.sh fails to install an ARM64 version (the amd64 version works, there's probably a new release as well)
hack/install-yq.sh fails to install an ARM64 version (the amd64 version works, we need to move to yq 4 to have an arm64 version)
We need go install sigs.k8s.io/controller-tools/cmd/[email protected] instead of go get
etcd is not installed by install-tools but required for the tests
hack/install-kubebuilder.sh only copy the kubebuilder binary and forgets the assets (etcd/kube-apiserver). Downloading the amd64 version and replacing etcd by an arm64 version works
make e2e doesn't work, the pod says runtime: failed to create new OS thread (have 2 already; errno=22)

Resource type WPA metrics do not appear to be averaging correctly

Describe what happened:
Resource type WPA metrics do not appear to be calculating average values, rather they are calculating totals.

Describe what you expected:
I expect the values to be averaged.

> k top pod
NAME                                       CPU(cores)   MEMORY(bytes)   
ingress-nginx-controller-64dd76d79-lfckl   7m           122Mi           
ingress-nginx-controller-64dd76d79-rzxvc   8m           124Mi           
 
> k describe wpa nginx-wpa | grep -A 1 "Current Average Value"
      Current Average Value:  16m
      Name:                   cpu
--
      Current Average Value:  258920448
      Name:                   memory

Publish to a Helm Chart repository

Hello 👋

Are there plans to publish this to either the official Helm Chart repo or one of your own making?

It would help a lot with installing this if we didn't have to commit a clone of this repo to our own VCS just to use the helm chart :)

shema watermarkpodautoscaler-datadoghq-v1alpha1.json not found

ERR  - myservice/templates/watermarkpodautoscaler.yaml: Failed initializing schema https://kubernetesjsonschema.dev/master-standalone-strict/watermarkpodautoscaler-datadoghq-v1alpha1.json: Could not read schema from HTTP, response status is 404 Not Found
 dry run...history.go:56: [debug] getting history for release myservice
upgrade.go:123: [debug] preparing upgrade for vehicle-region-store
upgrade.go:131: [debug] performing update for vehicle-region-store

Get a wpa based on CPU/Memory resource does not show values

I created a WPA object which looks for internal metrics as described here:

apiVersion: datadoghq.com/v1alpha1
kind: WatermarkPodAutoscaler
metadata:
  name: watermarkpodautoscaler-internal
spec:
  maxReplicas: 3
  minReplicas: 1
  tolerance: 1
  readinessDelay: 10
  scaleTargetRef:
    kind: Deployment
    apiVersion: apps/v1
    name: nginx-deployment-wpa
  metrics:
  - type: Resource
    resource:
      highWatermark: "100m"
      lowWatermark: "50m"
      name: cpu
      metricSelector:
        matchLabels:
          app: nginx-wpa

The status tells me that it is under monitoring :

kubectl describe wpa watermarkpodautoscaler-internal
[...]
  Current Metrics:
    Resource:
      Current Average Value:  24m
      Name:                   cpu
    Type:                     Resource
[...]

Unfortunately a simple get command shows me empty fields for "VALUE, WATERMARK, ..."

❯ k get wpa watermarkpodautoscaler-internal
NAME                              VALUE   HIGH WATERMARK   LOW WATERMARK   AGE   MIN REPLICAS   MAX REPLICAS   DRY-RUN
watermarkpodautoscaler-internal                                            14m   1              3

Can we improve this ?

crash loop - flag provided but not defined: -zap-level

Seems like used container params are unknown to the container/ go application

kubectl logs watermarkpodautoscaler-75cd69b9f7-tz88b

flag provided but not defined: -zap-level
Usage of /manager:
  -enable-leader-election
        Enable leader election for controller manager. Enabling this will ensure there is only one active controller manager. (default true)
  -health-port int
        Port to use for the health probe (default 9440)
  -kubeconfig string
        Paths to a kubeconfig. Only required if out-of-cluster.
  -leader-election-resource string
        determines which resource lock to use for leader election. option:[configmapsleases|endpointsleases|configmaps] (default "configmaps")
  -logEncoder string
        log encoding ('json' or 'console') (default "json")
  -loglevel value
        Set log level
  -metrics-addr string
        The address the metric endpoint binds to. (default ":8080")
  -syncPeriodSeconds int
        The informers resync period in seconds (default 3600)
  -version
        print version and exit

datadog / watermarkpodautoscaler Goto Github PK

watermarkpodautoscaler's People

Stargazers

Watchers

Forkers

watermarkpodautoscaler's Issues

Recommend Projects

Recommend Topics

Recommend Org