datadog / watermarkpodautoscaler Goto Github PK
View Code? Open in Web Editor NEWCustom controller that extends the Horizontal Pod Autoscaler
License: Apache License 2.0
Custom controller that extends the Horizontal Pod Autoscaler
License: Apache License 2.0
I read the doc and did not yet get an idea what happens if i would combine multiple metrics even of the same type
external and resource.
What is possible and how will it behave? Given these examples
1.)
I think i understood if you use type external only one is allowed.
- external:
highWatermark: 400m
lowWatermark: 150m
metricName: custom.request_duration.max
metricSelector:
matchLabels:
app: {{ .Chart.Name }}
release: {{ .Release.Name }}
2.) But what if use resource? Can i have two metrics e.g.
- Resource:
resource:
name: memory
target:
type: Utilization
averageUtilization: 85%
type: Resource
- Resource:
resource:
name: cpu
target:
type: Utilization
averageUtilization: 90%
type: Resource
And what happens if memory is at 90% and cpu at 50%? Will it scale up or down?
3.) And if i even mix kinds?
a.)
- external:
highWatermark: 90
lowWatermark: 60
metricName: kubernetes.cpu.usage
metricSelector:
matchLabels:
app: {{ .Chart.Name }}
release: {{ .Release.Name }}
- Resource:
resource:
name: memory
target:
type: Utilization
averageUtilization: 85%
type: Resource
b.)
- external:
highWatermark: 85
lowWatermark: 0
metricName: kubernetes.ememoty.usage
metricSelector:
matchLabels:
app: {{ .Chart.Name }}
release: {{ .Release.Name }}
- Resource:
resource:
name: memory
target:
type: Utilization
averageUtilization: 85%
type: Resource
4.) What if i mix and create contra dictionary requirements
- external:
highWatermark: 80
lowWatermark: 40
metricName: kubernetes.cpu.usage
metricSelector:
matchLabels:
app: {{ .Chart.Name }}
release: {{ .Release.Name }}
- Resource:
resource:
name: cpu
target:
type: Utilization
averageUtilization: 90%
type: Resource
Hello 👋
We're trying to take the WPA into use, but we keep seeing this error message:
{
"level": "error",
"ts": 1594040003.47035,
"logger": "wpa_controller",
"msg": "The WPA controller was unable to update the number of replicas",
"Request.Namespace": "web-services-staging",
"Request.Name": "nextapi",
"error": "WatermarkPodAutoscaler.datadoghq.com \"nextapi\" is invalid: []: Invalid value: map[string]interface {}{\"apiVersion\":\"datadoghq.com/v1alpha1\", \"kind\":\"WatermarkPodAutoscaler\", \"metadata\":map[string]interface {}{\"annotations\":map[string]interface {}{\"meta.helm.sh/release-name\":\"nextapi\", \"meta.helm.sh/release-namespace\":\"web-services-staging\"}, \"creationTimestamp\":\"2020-07-06T12:49:21Z\", \"generation\":2, \"labels\":map[string]interface {}{\"app\":\"nextapi\", \"app.kubernetes.io/managed-by\":\"Helm\", \"chart\":\"mozart-0.4.0\", \"env\":\"staging\", \"heritage\":\"Helm\", \"region\":\"eu-west-1\", \"release\":\"nextapi\", \"stage\":\"staging\"}, \"name\":\"nextapi\", \"namespace\":\"web-services-staging\", \"resourceVersion\":\"124372418\", \"uid\":\"51196075-0345-4839-8526-3cf805be0376\"}, \"spec\":map[string]interface {}{\"algorithm\":\"absolute\", \"downscaleForbiddenWindowSeconds\":60, \"maxReplicas\":50, \"metrics\":[]interface {}{map[string]interface {}{\"external\":map[string]interface {}{\"highWatermark\":\"1\", \"lowWatermark\":\"0\", \"metricName\":\"php_fpm.listen_queue.size\", \"metricSelector\":map[string]interface {}{\"matchLabels\":map[string]interface {}{\"app\":\"nextapi\", \"region\":\"eu-west-1\", \"stage\":\"staging\"}}}, \"type\":\"External\"}}, \"minReplicas\":2, \"scaleDownLimitFactor\":30, \"scaleTargetRef\":map[string]interface {}{\"apiVersion\":\"apps/v1\", \"kind\":\"Deployment\", \"name\":\"nextapi\"}, \"scaleUpLimitFactor\":50, \"tolerance\":0.01, \"upscaleForbiddenWindowSeconds\":30}, \"status\":map[string]interface {}{\"conditions\":[]interface {}{map[string]interface {}{\"lastTransitionTime\":\"2020-07-06T12:53:23Z\", \"message\":\"Scaling changes can be applied\", \"reason\":\"DryRun mode disabled\", \"status\":\"False\", \"type\":\"DryRun\"}, map[string]interface {}{\"lastTransitionTime\":\"2020-07-06T12:53:23Z\", \"message\":\"the WPA controller was able to get the target's current scale\", \"reason\":\"SucceededGetScale\", \"status\":\"True\", \"type\":\"AbleToScale\"}, map[string]interface {}{\"lastTransitionTime\":\"2020-07-06T12:53:23Z\", \"message\":\"the HPA was unable to compute the replica count: unable to get external metric web-services-staging/php_fpm.listen_queue.size/&LabelSelector{MatchLabels:map[string]string{app: nextapi,region: eu-west-1,stage: staging,},MatchExpressions:[],}: unable to fetch metrics from external metrics API: the server is currently unable to handle the request (get php_fpm.listen_queue.size.external.metrics.k8s.io)\", \"reason\":\"FailedGetExternalMetric\", \"status\":\"False\", \"type\":\"ScalingActive\"}}, \"currentMetrics\":interface {}(nil), \"currentReplicas\":2, \"desiredReplicas\":0}}: validation failure list:\nstatus.currentMetrics in body must be of type array: \"null\"",
"stacktrace": "github.com/go-logr/zapr.(*zapLogger).Error\n\twatermarkpodautoscaler/vendor/github.com/go-logr/zapr/zapr.go:128\ngithub.com/DataDog/watermarkpodautoscaler/pkg/controller/watermarkpodautoscaler.(*ReconcileWatermarkPodAutoscaler).reconcileWPA\n\twatermarkpodautoscaler/pkg/controller/watermarkpodautoscaler/watermarkpodautoscaler_controller.go:428\ngithub.com/DataDog/watermarkpodautoscaler/pkg/controller/watermarkpodautoscaler.(*ReconcileWatermarkPodAutoscaler).Reconcile\n\twatermarkpodautoscaler/pkg/controller/watermarkpodautoscaler/watermarkpodautoscaler_controller.go:344\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\twatermarkpodautoscaler/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:216\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\twatermarkpodautoscaler/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:192\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\twatermarkpodautoscaler/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:171\nk8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\twatermarkpodautoscaler/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:152\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\twatermarkpodautoscaler/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:153\nk8s.io/apimachinery/pkg/util/wait.Until\n\twatermarkpodautoscaler/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88"
}
This is our WPA definition:
apiVersion: datadoghq.com/v1alpha1
kind: WatermarkPodAutoscaler
metadata:
name: nextapi
labels:
app: nextapi
chart: mozart-0.4.0
release: nextapi
heritage: Helm
env: staging
region: eu-west-1
stage: staging
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: nextapi
downscaleForbiddenWindowSeconds: 60
upscaleForbiddenWindowSeconds: 30
scaleDownLimitFactor: 30
scaleUpLimitFactor: 50
minReplicas: 2
maxReplicas: 50
metrics:
- external:
highWatermark: "1"
lowWatermark: "0"
metricName: php_fpm.listen_queue.size
metricSelector:
matchLabels:
app: nextapi
stage: staging
region: eu-west-1
type: External
tolerance: 0.01
I see in the readme that only one metric per WPA resource is officially supported. Can you expand upon that a little bit, what are the potential issues with trying to use multiple metrics per? Is it that they can potentially contradict each other on scale up/down behavior?
Describe what happened:
We deploy WPA object for our services using a helm chart. This chart contains a manifest template that manage the WPA objects.
Whatever is the value we define for dryRun attribute (false/true, on/off, empty value) if someone set it to true using kubectl, the value is never reset to false when we update the chart release.
Describe what you expected:
Setting dryRun to false in the WPA manifest should set the value to false when the manifest is applied by helm.
This works fine using the kubectl providen in the README file
Steps to reproduce the issue:
Additional environment details (Kubernetes version, etc):
quantities and metric relation
Taken from the docs..
"They are specified as Quantities, so you can use m | "" | k | M | G | T | P | E to easily represent the value you want to use."
The above sentence quotes that highWatermark and lowWatermark values can or must? be extended by quantity.
I assume its dependent on the type of metric.
What type of quantity is "" ?
What about percent values like i think it shall be able to use all kind of metrics which have a defined max and a min value like e.g kubernetes.memory.usage_pct or kubernetes.cpu.usage_pct. Do i need to set only the value 90 or 90% ?
If i don't add a quantity but just the number, will it take a default (which) or will it fail?
case sensitiveness
There are different metrics i can use . eg.
The docker Metrics e.g "Memory" have upper case. Is there generally any case sensitiveness ?
metric context/selection
Taking above two metric sources - system/docker and kubernetes - i could use
similar metrics from both e.g system.mem.used vs kubernetes.memory.usage
If i deploy a resource wpa
metricSelector:
matchLabels:
app: {{ .Chart.Name }}
release: {{ .Release.Name }}
will the both metric measurements be scoped to the container level per pod or do they have different scopes and which?
Which of these might be the better choice to control the scale
Can WPA support openshiift 3.11 or kubernetes 1.11?
I try to apply WPA on okd3.11 and I got the following error
must only have "properties", "required" or "description" at the root if the status subresource is enabled
I tried to mark "subresources" which is in CRD WatermarkPodAutoscaler and it could be deployed successfully.
file datadoghq.com_watermarkpodautoscalers_crd.yaml
33 shortNames:
34 - wpa
35 singular: watermarkpodautoscaler
36 scope: Namespaced
37 subresources: # delete these two lines
38 status: {} # delete these two lines
39 validation:
40 openAPIV3Schema:
41 description: WatermarkPodAutoscaler is the Schema for the watermarkpodautoscalers
42 API
But I face a problem that datadog cluster didn't detect wpa created and didn't collect custom metrics from Datadog server when I add WPA as below
1 apiVersion: datadoghq.com/v1alpha1
2 kind: WatermarkPodAutoscaler
3 metadata:
4 name: consumer
5 namespace: kafka-project
6 spec:
7 # Add fields here
8 algorithm: average
9 maxReplicas: 15
10 minReplicas: 1
11 tolerance: 0.01
12 downscaleForbiddenWindowSeconds: 300
13 upscaleForbiddenWindowSeconds: 15
14 scaleUpLimitFactor: 50
15 scaleDownLimitFactor: 20
16 scaleTargetRef:
17 kind: Deployment
18 apiVersion: apps/v1
19 name: consumer
20 readinessDelay: 10
21 metrics:
22 # Resource or External type supported
23 # Example usage of External type
24 - type: External
25 external:
26 highWatermark: "1"
27 lowWatermark: "1"
28 metricName: <metrics_name>
29 metricSelector:
30 matchLabels:
31 kube_deployment: consumer
32 kube_namespace: kafka-project
Erorr log is below
Datadog cluster agent
2020-05-12 12:13:04 UTC | CLUSTER | DEBUG | (pkg/aggregator/aggregator.go:554 in sendEvents) | Flushing 1 events to the forwarder
2020-05-12 12:13:04 UTC | CLUSTER | DEBUG | (pkg/aggregator/aggregator.go:393 in pushSeries) | Flushing 2 series to the forwarder
2020-05-12 12:13:04 UTC | CLUSTER | DEBUG | (pkg/aggregator/aggregator.go:506 in sendServiceChecks) | Flushing 5 service checks to the forwarder
2020-05-12 12:13:04 UTC | CLUSTER | DEBUG | (pkg/serializer/split/split.go:77 in Payloads) | The payload was not too big, returning the full payload
2020-05-12 12:13:04 UTC | CLUSTER | DEBUG | (pkg/serializer/split/split.go:77 in Payloads) | The payload was not too big, returning the full payload
2020-05-12 12:13:04 UTC | CLUSTER | DEBUG | (pkg/serializer/split/split.go:77 in Payloads) | The payload was not too big, returning the full payload
2020-05-12 12:13:05 UTC | CLUSTER | DEBUG | (pkg/collector/runner/runner.go:263 in work) | Running check kubernetes_apiserver
2020-05-12 12:13:05 UTC | CLUSTER | DEBUG | (pkg/util/kubernetes/apiserver/leaderelection/leaderelection.go:164 in EnsureLeaderElectionRuns) | Currently Leader: true. Leader identity: "datadog-cluster-agent-59858975fd-98rfr"
2020-05-12 12:13:05 UTC | CLUSTER | DEBUG | (pkg/util/kubernetes/apiserver/common/common.go:23 in GetResourcesNamespace) | No configured namespace for the resource, fetching from the current context
2020-05-12 12:13:05 UTC | CLUSTER | DEBUG | (pkg/util/kubernetes/apiserver/events.go:55 in RunEventCollection) | Starting to watch from 60726555
2020-05-12 12:13:07 UTC | CLUSTER | DEBUG | (pkg/util/kubernetes/apiserver/events.go:113 in RunEventCollection) | Collected 2 events, will resume watching from resource version 60726597
2020-05-12 12:13:07 UTC | CLUSTER | DEBUG | (pkg/util/kubernetes/apiserver/common/common.go:23 in GetResourcesNamespace) | No configured namespace for the resource, fetching from the current context
2020-05-12 12:13:07 UTC | CLUSTER | DEBUG | (pkg/util/kubernetes/apiserver/apiserver.go:328 in UpdateTokenInConfigmap) | Updated event.tokenKey to 60726597 in the ConfigMap datadogtoken
2020-05-12 12:13:07 UTC | CLUSTER | DEBUG | (pkg/collector/runner/runner.go:329 in work) | Done running check kubernetes_apiserver
2020-05-12 12:13:08 UTC | CLUSTER | DEBUG | (pkg/clusteragent/custommetrics/provider.go:196 in GetExternalMetric) | External metrics returned: []external_metrics.ExternalMetricValue{}
WPA controller
E0512 12:10:52.718448 1 memcache.go:199] couldn't get resource list for external.metrics.k8s.io/v1beta1: Got empty response for: external.metrics.k8s.io/v1beta1
{"level":"info","ts":1589285452.7814093,"logger":"wpa_controller","msg":"Target deploy","Request.Namespace":"kafka-project","Request.Name":"consumer","replicas":2}
{"level":"error","ts":1589285452.7956553,"logger":"wpa_controller","msg":"The WPA controller was unable to update the number of replicas","Request.Namespace":"kafka-project","Request.Name":"consumer","error":"the server could not find the requested resource (put watermarkpodautoscalers.datadoghq.com consumer)","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\twatermarkpodautoscaler/vendor/github.com/go-logr/zapr/zapr.go:128\ngithub.com/DataDog/watermarkpodautoscaler/pkg/controller/watermarkpodautoscaler.(*ReconcileWatermarkPodAutoscaler).reconcileWPA\n\twatermarkpodautoscaler/pkg/controller/watermarkpodautoscaler/watermarkpodautoscaler_controller.go:428\ngithub.com/DataDog/watermarkpodautoscaler/pkg/controller/watermarkpodautoscaler.(*ReconcileWatermarkPodAutoscaler).Reconcile\n\twatermarkpodautoscaler/pkg/controller/watermarkpodautoscaler/watermarkpodautoscaler_controller.go:344\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\twatermarkpodautoscaler/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:216\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\twatermarkpodautoscaler/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:192\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\twatermarkpodautoscaler/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:171\nk8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\twatermarkpodautoscaler/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:152\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\twatermarkpodautoscaler/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:153\nk8s.io/apimachinery/pkg/util/wait.Until\n\twatermarkpodautoscaler/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88"}
{"level":"info","ts":1589285467.796208,"logger":"wpa_controller","msg":"Reconciling WatermarkPodAutoscaler","Request.Namespace":"kafka-project","Request.Name":"consumer"}
{"level":"info","ts":1589285467.8238223,"logger":"wpa_controller","msg":"Target deploy","Request.Namespace":"kafka-project","Request.Name":"consumer","replicas":2}
{"level":"error","ts":1589285467.8450387,"logger":"wpa_controller","msg":"The WPA controller was unable to update the number of replicas","Request.Namespace":"kafka-project","Request.Name":"consumer","error":"the server could not find the requested resource (put watermarkpodautoscalers.datadoghq.com consumer)","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\twatermarkpodautoscaler/vendor/github.com/go-logr/zapr/zapr.go:128\ngithub.com/DataDog/watermarkpodautoscaler/pkg/controller/watermarkpodautoscaler.(*ReconcileWatermarkPodAutoscaler).reconcileWPA\n\twatermarkpodautoscaler/pkg/controller/watermarkpodautoscaler/watermarkpodautoscaler_controller.go:428\ngithub.com/DataDog/watermarkpodautoscaler/pkg/controller/watermarkpodautoscaler.(*ReconcileWatermarkPodAutoscaler).Reconcile\n\twatermarkpodautoscaler/pkg/controller/watermarkpodautoscaler/watermarkpodautoscaler_controller.go:344\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\twatermarkpodautoscaler/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:216\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\twatermarkpodautoscaler/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:192\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\twatermarkpodautoscaler/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:171\nk8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\twatermarkpodautoscaler/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:152\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\twatermarkpodautoscaler/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:153\nk8s.io/apimachinery/pkg/util/wait.Until\n\twatermarkpodautoscaler/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88"}
Is there a workaround I can do for supporting k8s 1.11? Could you help to support k8s 1.11?
Thanks for your help.
Since this controller extracts metrics using the Datadog API I would like to know how this can be brought in line with the low API rate-limits for these kind of calls (source: https://docs.datadoghq.com/api/#rate-limiting):
Additionally there is no way to monitor the current rate-limit budget so it just starts failing silently.
There should at least be some kind of warning in the README or information on what to do about this.
21s Warning Failed pod/watermarkpodautoscaler-66d6d96c96-9ms4b Error: failed to create containerd task: OCI runtime create failed: container_linux.go:370: starting container process caused: exec: "watermarkpodautoscaler": executable file not found in $PATH: unknown
Describe what happened:
WPA try to scale-out Openshift DeploymentConfig but get the following error
{"level":"info","ts":1600848500.4110653,"logger":"wpa_controller","msg":"Reconciling WatermarkPodAutoscaler","Request.Namespace":"nginx-preloader-sample","Request.Name":"wpa4"}
{"level":"error","ts":1600848500.412769,"logger":"wpa_controller","msg":"RunTime error in reconcileWPA","Request.Namespace":"nginx-preloader-sample","Request.Name":"wpa4","returnValue":"runtime error: invalid memory address or nil pointer dereference","error":"recover error","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\twatermarkpodautoscaler/vendor/github.com/go-logr/zapr/zapr.go:128\ngithub.com/DataDog/watermarkpodautoscaler/pkg/controller/watermarkpodautoscaler.(*ReconcileWatermarkPodAutoscaler).reconcileWPA.func1\n\twatermarkpodautoscaler/pkg/controller/watermarkpodautoscaler/watermarkpodautoscaler_controller.go:360\nruntime.gopanic\n\t/usr/local/Cellar/go/1.13.4/libexec/src/runtime/panic.go:679\nruntime.panicmem\n\t/usr/local/Cellar/go/1.13.4/libexec/src/runtime/panic.go:199\nruntime.sigpanic\n\t/usr/local/Cellar/go/1.13.4/libexec/src/runtime/signal_unix.go:394\ngithub.com/DataDog/watermarkpodautoscaler/pkg/controller/watermarkpodautoscaler.(*ReconcileWatermarkPodAutoscaler).reconcileWPA\n\twatermarkpodautoscaler/pkg/controller/watermarkpodautoscaler/watermarkpodautoscaler_controller.go:379\ngithub.com/DataDog/watermarkpodautoscaler/pkg/controller/watermarkpodautoscaler.(*ReconcileWatermarkPodAutoscaler).Reconcile\n\twatermarkpodautoscaler/pkg/controller/watermarkpodautoscaler/watermarkpodautoscaler_controller.go:344\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\twatermarkpodautoscaler/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:216\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\twatermarkpodautoscaler/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:192\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\twatermarkpodautoscaler/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:171\nk8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\twatermarkpodautoscaler/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:152\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\twatermarkpodautoscaler/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:153\nk8s.io/apimachinery/pkg/util/wait.Until\n\twatermarkpodautoscaler/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88"}
WPA.yaml
apiVersion: datadoghq.com/v1alpha1
kind: WatermarkPodAutoscaler
metadata:
name: wpa4
namespace: nginx-preloader-sample
spec:
algorithm: average
maxReplicas: 20
minReplicas: 1
tolerance: 0.01
downscaleForbiddenWindowSeconds: 300
upscaleForbiddenWindowSeconds: 15
scaleUpLimitFactor: 90
scaleDownLimitFactor: 90
scaleTargetRef:
kind: DeploymentConfig
apiVersion: apps.openshift.io/v1
name: nginx-prepared
readinessDelay: 10
metrics:
- type: External
external:
highWatermark: "1"
lowWatermark: "1"
metricName: federatorai.recommendation
metricSelector:
matchLabels:
resource: replicas
kube_cluster: jason-4-115
oshift_deployment_config: nginx-prepared
kube_namespace: nginx-preloader-sample
Cluster Agent can get external metrics
* watermark pod autoscaler: nginx-preloader-sample/wpa4
Metric name: federatorai.recommendation
Labels:
- kube_cluster: jason-4-115
- kube_namespace: nginx-preloader-sample
- oshift_deployment_config: nginx-prepared
- resource: replicas
Value: 6
Timestamp: 2020-09-23 08:18:00.000000 UTC
Valid: true
Other WPA works as we expected
* watermark pod autoscaler: myproject/wpa3
Metric name: federatorai.recommendation
Labels:
- kube_cluster: jason-4-115
- kube_deployment: consumer3
- kube_namespace: myproject
- resource: replicas
Value: 7
Timestamp: 2020-09-23 08:25:00.000000 UTC
Valid: true
WPA yaml
apiVersion: datadoghq.com/v1alpha1
kind: WatermarkPodAutoscaler
metadata:
name: wpa3
namespace: myproject
spec:
# Add fields here
# algorithm must be average
algorithm: average
maxReplicas: 15
minReplicas: 1
tolerance: 0.01
downscaleForbiddenWindowSeconds: 300
upscaleForbiddenWindowSeconds: 15
scaleUpLimitFactor: 90
scaleDownLimitFactor: 90
scaleTargetRef:
kind: Deployment
apiVersion: apps/v1
name: consumer3
readinessDelay: 10
metrics:
# Resource or External type supported
# Example usage of External type
- type: External
external:
# do not edit highWatermakr, and lowWatermark
# highWatermark and lowWatermark must be 1
highWatermark: "1"
lowWatermark: "1"
metricName: federatorai.recommendation
metricSelector:
matchLabels:
resource: replicas
kube_cluster: jason-4-115
kube_deployment: consumer3
kube_namespace: myproject
WPA log
{"level":"info","ts":1600849222.9556293,"logger":"wpa_controller","msg":"Successful rescale","Request.Namespace":"myproject","Request.Name":"wpa3","currentReplicas":6,"desiredReplicas":7,"rescaleReason":"federatorai.recommendation{map[kube_cluster:jason-4-115 kube_deployment:consumer3 kube_namespace:myproject resource:replicas]} above target"}
Describe what you expected:
WPA can scale-out Openshift DeploymentConfig successfully.
Steps to reproduce the issue:
Additional environment details (Kubernetes version, etc):
openshift v3.11.0+8f721f2-450
kubernetes v1.11.0+d4cacc0
WPA image
image: datadog/watermarkpodautoscaler:v0.1.0
"docker.io/datadog/watermarkpodautoscaler:v0.3.0-rc5": failed to resolve
reference "docker.io/datadog/watermarkpodautoscaler:v0.3.0-rc5": docker.io/datadog/watermarkpodautoscaler:v0.3.0-rc5:
Trying to understand the difference or historical evolution? from A.) external_metrics with embedded query as part of the wpa resource spec to B.) external_metrics referencing custom metric defined as dedicated
A.)
apiVersion: datadoghq.com/v1alpha1
kind: WatermarkPodAutoscaler
metadata:
name: {{ .Chart.Name }}
namespace: {{ .Release.Namespace }}
spec:
metrics:
- type: External
external:
metricName: "<METRIC_NAME>"
metricSelector:
matchLabels:
<TAG_KEY>: <TAG_VALUE>
B.)
apiVersion: datadoghq.com/v1alpha1
kind: DatadogMetric
metadata:
name: your_datadogmetric_name
namespace: {{ .Release.Namespace }}
labels:
{{- include "labels" . | indent 4 }}
spec:
query: avg:kubernetes.cpu.usage{app:myapp,release:myapp}.rollup(30)
apiVersion: datadoghq.com/v1alpha1
kind: WatermarkPodAutoscaler
metadata:
name: {{ .Chart.Name }}
namespace: {{ .Release.Namespace }}
spec:
metrics:
- type: External
external:
metricName: "datadogmetric@{{ .Release.Namespace }}:your_datadogmetric_name"
1.) Is it like that B is the new version to spec metrics used in wpa instead of A since Kubernetes v1.2 allows such?
2.) Why does B outweights A, in features, does it? Is A no longer best practice or even to be sundowned?
3.) Can instead defining a emtric resource via k8 manifest also use the beta datadog ui feature to create custom metric as "your_datadogmetric_name" and it is then referenceable in any wpa resource spec as well?
4.) If i change the query live in ui for an already depliyed wpa using it, how fast will it be pulled?
5.) Does the in the query used "labels" will match always to the label i specified on the application i want to apply the metric filter on (e.g deployemnt.metadata.labels ?
6.) Creating new custom metric either via UI or k8 manifest - which tags can they filter on (e.g pods, deployments, deamons set)
Describe what happened:
When I run the autoscaler in HA, having a second pod in standby, the second pod will log out that it is standing by a few times and then exit. Eventually the pod enters a CrashBackoffLoop
state, which means it will not actually be standing by part of the time.
What I see in the pod describe
output is:
Last State: Terminated
Reason: Error
Exit Code: 2
Started: Wed, 17 Mar 2021 11:54:57 -0400
Finished: Wed, 17 Mar 2021 11:55:27 -0400
Ready: False
Restart Count: 3258
What I see in the logs:
{"level":"info","ts":1615997228.2988877,"logger":"cmd","msg":"Version: v0.2.0-dirty"}
{"level":"info","ts":1615997228.2989118,"logger":"cmd","msg":"Build time: 2020-09-09/20:03:05"}
{"level":"info","ts":1615997228.2989151,"logger":"cmd","msg":"Git tag: v0.2.0"}
{"level":"info","ts":1615997228.298917,"logger":"cmd","msg":"Git Commit: 3c5176693cdf2838c54298fb6f732c4ac21dbe86"}
{"level":"info","ts":1615997228.2989194,"logger":"cmd","msg":"Go Version: go1.13.15"}
{"level":"info","ts":1615997228.2989216,"logger":"cmd","msg":"Go OS/Arch: linux/amd64"}
{"level":"info","ts":1615997228.2989237,"logger":"cmd","msg":"Version of operator-sdk: v0.13.0"}
{"level":"info","ts":1615997228.299064,"logger":"leader","msg":"Trying to become the leader."}
{"level":"info","ts":1615997229.3133864,"logger":"leader","msg":"Found existing lock","LockOwner":"watermarkpodautoscaler-69cc854fbf-dqjbg"}
{"level":"info","ts":1615997229.3252141,"logger":"leader","msg":"Not the leader. Waiting."}
{"level":"info","ts":1615997230.456776,"logger":"leader","msg":"Not the leader. Waiting."}
{"level":"info","ts":1615997232.8421242,"logger":"leader","msg":"Not the leader. Waiting."}
{"level":"info","ts":1615997237.3827972,"logger":"leader","msg":"Not the leader. Waiting."}
{"level":"info","ts":1615997246.0937243,"logger":"leader","msg":"Not the leader. Waiting."}
Describe what you expected:
I expect that the pod can remain online and continue to check if it can become leader without panicking.
Steps to reproduce the issue:
Additional environment details (Kubernetes version, etc):
Autoscaler image: datadog/watermarkpodautoscaler:v0.2.0
, exact commit hash is in the logs above
I'm seeing this behaviour in multiple clusters, different kubernetes versions:
Server Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.9", GitCommit:"94f372e501c973a7fa9eb40ec9ebd2fe7ca69848", GitTreeState:"clean", BuildDate:"2020-09-16T13:47:43Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.7", GitCommit:"6c143d35bb11d74970e7bc0b6c45b6bfdffc0bd4", GitTreeState:"clean", BuildDate:"2019-12-11T12:34:17Z", GoVersion:"go1.12.12", Compiler:"gc", Platform:"linux/amd64"}
thanks datadog. this is a exciting project. but why not support cpu metrics?
{"level":"info","ts":1624011129.786673,"logger":"controllers.WatermarkPodAutoscaler","msg":"Failed to compute desired number of replicas based on listed metrics.","watermarkpodautoscaler":"dev/myapp","reference":"Deployment/dev/myapp","error":"failed to get external metric kubernetes.cpu.usage: unable to get external metric dev/kubernetes.cpu.usage/&LabelSelector{MatchLabels:map[string]string{app: myapp,release: myapp,},MatchExpressions:[]LabelSelectorRequirement{},}: unable to fetch metrics from external metrics API: Internal error occurred: DatadogMetric is invalid, err: Invalid metric (from backend), query: avg:kubernetes.cpu.usage{app:myapp,release:myapp}.rollup(30)"}
{"level":"info","ts":1624011144.793971,"logger":"controllers.WatermarkPodAutoscaler","msg":"Target deploy","watermarkpodautoscaler":"dev/myapp","replicas":2}
{"level":"info","ts":1624011144.7941537,"logger":"controllers.WatermarkPodAutoscaler","msg":"getReadyPodsCount","watermarkpodautoscaler":"dev/myapp","full podList length":2,"toleratedAsReadyPodCount":2,"incorrectly targeted pods":0}
{"level":"info","ts":1624011144.8295028,"logger":"controllers.WatermarkPodAutoscaler","msg":"Failed to compute desired number of replicas based on listed metrics.","watermarkpodautoscaler":"dev/myapp","reference":"Deployment/dev/myapp","error":"failed to get external metric kubernetes.cpu.usage: unable to get external metric dev/kubernetes.cpu.usage/&LabelSelector{MatchLabels:map[string]string{app: myapp,release: myapp,},MatchExpressions:[]LabelSelectorRequirement{},}: unable to fetch metrics from external metrics API: Internal error occurred: DatadogMetric is invalid, err: Invalid metric (from backend), query: avg:kubernetes.cpu.usage{app:myapp,release:myapp}.rollup(30)"}
{"level":"info","ts":1624011159.8374639,"logger":"controllers.WatermarkPodAutoscaler","msg":"Target deploy","watermarkpodautoscaler":"dev/myapp","replicas":2}
Ask for metric (without tags)
% kubectl get --raw "/apis/external.metrics.k8s.io/v1beta1/namespaces/dev/kubernetes.cpu.usage | jq ."
Error from server (InternalError): Internal error occurred: DatadogMetric not found for metric name: kubernetes.cpu.usage | jq ., datadogmetricid: datadog/dcaautogen-646a73ad876299907eb8035a2fa8e2b60ac832
Impersonate and ask for any metric
% kubectl get --raw "/apis/external.metrics.k8s.io/v1beta1/namespaces/dev/metric --as system:serviceaccount:dev:watermarkpodautoscaler"
Error from server (InternalError): Internal error occurred: DatadogMetric not found for metric name: metric --as system:serviceaccount:dev:watermarkpodautoscaler, datadogmetricid: datadog/dcaautogen-98eae079d0f4a80135d6f4b6f9762cea878b97
What did i wrong?
How can i general find out which metrics and which tags for it are available?
I went ahead and applied the contents of deploy/
to get a test of this running. Created a wpa for a deployment.
I got various errors regarding lacking rbac permissions in the provided role added a few (deployment,replicaset,statefulset get/list/watch, service, create/update/get). That didn't solve problems though. Checked and saw that your provided yamls apply version v0.0.1, went ahead and set it up to use v0.1.0 instead. Now I'm getting the following errors:
{"level":"info","ts":1578589126.4794915,"logger":"cmd","msg":"Version: 0.0.1"}
{"level":"info","ts":1578589126.4795249,"logger":"cmd","msg":"Build time: "}
{"level":"info","ts":1578589126.4795303,"logger":"cmd","msg":"Git tag: "}
{"level":"info","ts":1578589126.4795349,"logger":"cmd","msg":"Git Commit: "}
{"level":"info","ts":1578589126.4795387,"logger":"cmd","msg":"Go Version: go1.13.4"}
{"level":"info","ts":1578589126.4795427,"logger":"cmd","msg":"Go OS/Arch: linux/amd64"}
{"level":"info","ts":1578589126.4795468,"logger":"cmd","msg":"Version of operator-sdk: v0.12.0"}
{"level":"info","ts":1578589126.479719,"logger":"leader","msg":"Trying to become the leader."}
{"level":"info","ts":1578589127.4483173,"logger":"leader","msg":"No pre-existing lock was found."}
{"level":"info","ts":1578589127.453766,"logger":"leader","msg":"Became the leader."}
{"level":"info","ts":1578589128.408475,"logger":"controller-runtime.metrics","msg":"metrics server is starting to listen","addr":"0.0.0.0:8383"}
{"level":"info","ts":1578589128.4087286,"logger":"cmd","msg":"Registering Components."}
{"level":"info","ts":1578589128.4106596,"logger":"controller-runtime.controller","msg":"Starting EventSource","controller":"watermarkpodautoscaler-controller","source":"kind source: /, Kind="}
{"level":"info","ts":1578589128.4107704,"logger":"cmd","msg":"Starting the Cmd."}
{"level":"info","ts":1578589128.4110043,"logger":"controller-runtime.manager","msg":"starting metrics server","path":"/metrics"}
{"level":"info","ts":1578589128.5111277,"logger":"controller-runtime.controller","msg":"Starting Controller","controller":"watermarkpodautoscaler-controller"}
{"level":"info","ts":1578589128.6113448,"logger":"controller-runtime.controller","msg":"Starting workers","controller":"watermarkpodautoscaler-controller","worker count":1}
{"level":"info","ts":1578589128.6114364,"logger":"wpa_controller","msg":"Reconciling WatermarkPodAutoscaler","Request.Namespace":"default","Request.Name":"statsdgenerator-wpa"}
E0109 16:58:49.515122 1 memcache.go:199] couldn't get resource list for external.metrics.k8s.io/v1beta1: Got empty response for: external.metrics.k8s.io/v1beta1
{"level":"error","ts":1578589129.5824547,"logger":"wpa_controller","msg":"RunTime error in reconcileWPA","Request.Namespace":"default","Request.Name":"statsdgenerator-wpa","returnValue":"runtime error: invalid memory address or nil pointer dereference","error":"recover error","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\twatermarkpodautoscaler/vendor/github.com/go-logr/zapr/zapr.go:128\ngithub.com/DataDog/watermarkpodautoscaler/pkg/controller/watermarkpodautoscaler.(*ReconcileWatermarkPodAutoscaler).reconcileWPA.func1\n\twatermarkpodautoscaler/pkg/controller/watermarkpodautoscaler/watermarkpodautoscaler_controller.go:360\nruntime.gopanic\n\t/usr/local/Cellar/go/1.13.4/libexec/src/runtime/panic.go:679\nruntime.panicmem\n\t/usr/local/Cellar/go/1.13.4/libexec/src/runtime/panic.go:199\nruntime.sigpanic\n\t/usr/local/Cellar/go/1.13.4/libexec/src/runtime/signal_unix.go:394\ngithub.com/DataDog/watermarkpodautoscaler/pkg/controller/watermarkpodautoscaler.(*ReconcileWatermarkPodAutoscaler).reconcileWPA\n\twatermarkpodautoscaler/pkg/controller/watermarkpodautoscaler/watermarkpodautoscaler_controller.go:379\ngithub.com/DataDog/watermarkpodautoscaler/pkg/controller/watermarkpodautoscaler.(*ReconcileWatermarkPodAutoscaler).Reconcile\n\twatermarkpodautoscaler/pkg/controller/watermarkpodautoscaler/watermarkpodautoscaler_controller.go:344\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\twatermarkpodautoscaler/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:216\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\twatermarkpodautoscaler/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:192\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\twatermarkpodautoscaler/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:171\nk8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\twatermarkpodautoscaler/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:152\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\twatermarkpodautoscaler/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:153\nk8s.io/apimachinery/pkg/util/wait.Until\n\twatermarkpodautoscaler/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88"}
Thoughts?
It seems that K8s 1.22 is not compatible with WPA since your CRDs still use v1beta1 apiVersion: apiextensions.k8s.io/v1beta1
Link: CRD
link to Kubernetes documentation:
Deprecated API Migration Guide
hack/install-wwhrd.sh
fails to install an ARM64 version (the amd64 version works, there's probably a new release as well)hack/install-yq.sh
fails to install an ARM64 version (the amd64 version works, we need to move to yq 4 to have an arm64 version)go install sigs.k8s.io/controller-tools/cmd/[email protected]
instead of go get
install-tools
but required for the testshack/install-kubebuilder.sh
only copy the kubebuilder binary and forgets the assets (etcd/kube-apiserver). Downloading the amd64 version and replacing etcd by an arm64 version worksmake e2e
doesn't work, the pod says runtime: failed to create new OS thread (have 2 already; errno=22)
Describe what happened:
Resource type WPA metrics do not appear to be calculating average values, rather they are calculating totals.
Describe what you expected:
I expect the values to be averaged.
> k top pod
NAME CPU(cores) MEMORY(bytes)
ingress-nginx-controller-64dd76d79-lfckl 7m 122Mi
ingress-nginx-controller-64dd76d79-rzxvc 8m 124Mi
> k describe wpa nginx-wpa | grep -A 1 "Current Average Value"
Current Average Value: 16m
Name: cpu
--
Current Average Value: 258920448
Name: memory
Hello 👋
Are there plans to publish this to either the official Helm Chart repo or one of your own making?
It would help a lot with installing this if we didn't have to commit a clone of this repo to our own VCS just to use the helm chart :)
ERR - myservice/templates/watermarkpodautoscaler.yaml: Failed initializing schema https://kubernetesjsonschema.dev/master-standalone-strict/watermarkpodautoscaler-datadoghq-v1alpha1.json: Could not read schema from HTTP, response status is 404 Not Found
dry run...history.go:56: [debug] getting history for release myservice
upgrade.go:123: [debug] preparing upgrade for vehicle-region-store
upgrade.go:131: [debug] performing update for vehicle-region-store
I created a WPA object which looks for internal metrics as described here:
apiVersion: datadoghq.com/v1alpha1
kind: WatermarkPodAutoscaler
metadata:
name: watermarkpodautoscaler-internal
spec:
maxReplicas: 3
minReplicas: 1
tolerance: 1
readinessDelay: 10
scaleTargetRef:
kind: Deployment
apiVersion: apps/v1
name: nginx-deployment-wpa
metrics:
- type: Resource
resource:
highWatermark: "100m"
lowWatermark: "50m"
name: cpu
metricSelector:
matchLabels:
app: nginx-wpa
The status tells me that it is under monitoring :
kubectl describe wpa watermarkpodautoscaler-internal
[...]
Current Metrics:
Resource:
Current Average Value: 24m
Name: cpu
Type: Resource
[...]
Unfortunately a simple get command shows me empty fields for "VALUE, WATERMARK, ..."
❯ k get wpa watermarkpodautoscaler-internal
NAME VALUE HIGH WATERMARK LOW WATERMARK AGE MIN REPLICAS MAX REPLICAS DRY-RUN
watermarkpodautoscaler-internal 14m 1 3
Can we improve this ?
Seems like used container params are unknown to the container/ go application
kubectl logs watermarkpodautoscaler-75cd69b9f7-tz88b
flag provided but not defined: -zap-level
Usage of /manager:
-enable-leader-election
Enable leader election for controller manager. Enabling this will ensure there is only one active controller manager. (default true)
-health-port int
Port to use for the health probe (default 9440)
-kubeconfig string
Paths to a kubeconfig. Only required if out-of-cluster.
-leader-election-resource string
determines which resource lock to use for leader election. option:[configmapsleases|endpointsleases|configmaps] (default "configmaps")
-logEncoder string
log encoding ('json' or 'console') (default "json")
-loglevel value
Set log level
-metrics-addr string
The address the metric endpoint binds to. (default ":8080")
-syncPeriodSeconds int
The informers resync period in seconds (default 3600)
-version
print version and exit
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.