Hello, I am facing an issue trying to set up the prometheus-to-sd container. <p di

I have tried adding these two HPA configurations: <div class="highlight highlight-

error calling MarshalJSON / unsupported value: NaN,about googlecloudplatform/k8s-stackdriver

Comments (7)

AnthonMS commented on August 16, 2024

I have tried adding these two HPA configurations:

apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
metadata:
  name: hpa-prod-phpfpm
spec:
  scaleTargetRef:
    apiVersion: apps/v1beta1
    kind: Deployment
    name: valinor-api-prod
  minReplicas: 4
  maxReplicas: 10
  metrics:
    - type: Pods
      pods:
        metricName: phpfpm_active_processes
        targetAverageValue: 6
---
apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
  name: hpa-prod-external
spec:
  scaleTargetRef:
    apiVersion: apps/v1beta1
    kind: Deployment
    name: valinor-api-prod
  minReplicas: 4
  maxReplicas: 10
  metrics:
    - type: External
      external:
        metric: 
          name: custom.googleapis.com|phpfpm_active_processes
        target:
          averageValue: 6
          type: AverageValue

And none of them seem to work as they should. The one using external type does however look a little more weird than the other. (Edit: Doesn't it actually look like it's correctly finding the custom metrics?) Here is the result:

I know there are some kind of statistics coming into Google. Since I can see a phpfpm_active_processes in the metrics explorer, as a custom metric. It does however look a bit sketchy, since the active processes doesn't seem to distribute across the different pods. At least according to metrics explorer in google. But when I look directly at the /fpm-status and keep refreshing, then I can see the active processes are different on each pod in most cases.
This might be the phpfpm exporter container or the google prometheus-to-sd container, as that is still throwing the error:

1 stackdriver.go:60] Error while sending request to Stackdriver json: error calling MarshalJSON for type *monitoring.CreateTimeSeriesRequest: json: error calling MarshalJSON for type *monitoring.TimeSeries: json: error calling MarshalJSON for type *monitoring.Point: json: error calling MarshalJSON for type *monitoring.TypedValue: json: error calling MarshalJSON for type *monitoring.Distribution: json: unsupported value: NaN

I was hoping I could scale based on active fpm processes, but if that is not the case. Will it be possible to scale based on request-per-second to each pod? Something like in the first example in this custom metrics adapter?

Edit: But then the hpa using external type sometimes look like this. And that is what I find weird.

from k8s-stackdriver.

igoooor commented on August 16, 2024

I have the same logs gcr.io/google-containers/prometheus-to-sd and as a result I could indeed not use the phpfpm_active_processes as HPA metric.
You can also see that if you call: kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/namespaces/YOUR_NAMESPACE/pods/*/phpfpm_active_processes" the items will be empty.
The HPA uses that same endpoint, that's why it's not working since there are no items.
I could solve this issue by using the version v0.9.0 instead of v0.9.2
The logs ..MarshalJSON... is still there, however the metrics is now properly retrieved by the HPA (and by the kubectl command).
I can now use phpfpm_active_processes for my HPA.
I hope this will help you

from k8s-stackdriver.

AnthonMS commented on August 16, 2024

I could solve this issue by using the version v0.9.0 instead of v0.9.2

If you are talking about the prometheus-to-sd image version, then as you can see in my issue post that I am already using
gcr.io/google-containers/prometheus-to-sd:v0.9.2

Edit: Ahh shit sorry, I read it in the wrong order. I have been looking at yaml configs for too long. I will try it out, thank you.

And I can not use the phpfpm_active_processes as a HPA metrics. I did also try to get the raw by running a command like the one you suggest. And as you also say, the items are empty and I figured that was the reason the metrics not working.

Can you give me any insights in what you might have done differently in your setup?

from k8s-stackdriver.

igoooor commented on August 16, 2024

Here is my full yaml for the prometheus-to-sdsidecar:

- name: prometheus-to-sd
  image: gcr.io/google-containers/prometheus-to-sd:v0.9.0
  ports:
    - name: profiler
      containerPort: 6060
  command:
    - /monitor
    - --stackdriver-prefix=custom.googleapis.com
    - --source=:http://localhost:9253
    - --pod-id=$(POD_NAME)
    - --namespace-id=$(POD_NAMESPACE)
    - --cluster-location=$(CLUSTER_REGION)
    - --monitored-resource-type-prefix=k8s_
    - --scrape-interval=10s
    - --export-interval=10s
  resources:
    requests:
      cpu: 10m
    limits:
      cpu: 10m
  env:
    - name: POD_NAME
      valueFrom:
        fieldRef:
          fieldPath: metadata.name
    - name: POD_NAMESPACE
      valueFrom:
        fieldRef:
          fieldPath: metadata.namespace
    - name: CLUSTER_REGION
      value: REDACTED

I think it looks pretty similar to yours.
But as I said previously, I was using v0.9.2 initially, which resulted in HPA not working.
Then I switched to v0.9.0 and then HPA was working.
My HPA yaml is like so:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: REDACTED
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: REDACTED
  minReplicas: 3
  maxReplicas: 12
  metrics:
    - type: Pods
      pods:
        metric:
          name: phpfpm_active_processes
        target:
          type: AverageValue
          averageValue: 80 # or whatever fits your case

And I'm using kubernetes v1.24.1

from k8s-stackdriver.

AnthonMS commented on August 16, 2024

It does look very similar. I am afraid though, that it is not only this container/adapter that is causing me trouble. I don't know if you are using the Google Custom Metrics stackdriver adapter?

I had some trouble setting it up to begin with, but had some more success after installing it like this from this issue

gcloud iam service-accounts create custom-metrics-sd-adapter --project "$GCP_PROJECT_ID"

gcloud projects add-iam-policy-binding "$GCP_PROJECT_ID" \
  --member "serviceAccount:custom-metrics-sd-adapter@$GCP_PROJECT_ID.iam.gserviceaccount.com" \
  --role "roles/monitoring.editor"

gcloud iam service-accounts add-iam-policy-binding \
  --role roles/iam.workloadIdentityUser \
  --member "serviceAccount:$GCP_PROJECT_ID.svc.id.goog[custom-metrics/custom-metrics-stackdriver-adapter]" \
  "custom-metrics-sd-adapter@$GCP_PROJECT_ID.iam.gserviceaccount.com"

kubectl create -f https://raw.githubusercontent.com/GoogleCloudPlatform/k8s-stackdriver/master/custom-metrics-stackdriver-adapter/deploy/production/adapter.yaml

kubectl annotate serviceaccount custom-metrics-stackdriver-adapter \
  "iam.gke.io/gcp-service-account=custom-metrics-sd-adapter@$GCP_PROJECT_ID.iam.gserviceaccount.com" \
  --namespace custom-metrics

But I am getting errors like this:

E0712 08:46:48.948553       1 writers.go:117] apiserver was unable to write a JSON response: http2: stream closed
E0712 08:46:48.948614       1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http2: stream closed"}: http2: stream closed
E0712 08:46:48.949679       1 writers.go:130] apiserver was unable to write a fallback JSON response: http2: stream closed
E0712 08:46:48.951290       1 wrap.go:54] timeout or abort while handling: method=GET URI="/apis/custom.metrics.k8s.io/v1beta2" audit-ID="5b1dd951-6bba-4174-afd6-b4dccf87180d"
E0712 08:46:48.951344       1 timeout.go:135] post-timeout activity - time-elapsed: 3.869µs, GET "/apis/custom.metrics.k8s.io/v1beta2" result: <nil>
E0712 08:46:49.148354       1 writers.go:117] apiserver was unable to write a JSON response: http2: stream closed
E0712 08:46:49.148403       1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http2: stream closed"}: http2: stream closed
E0712 08:46:49.148460       1 wrap.go:54] timeout or abort while handling: method=GET URI="/apis/custom.metrics.k8s.io/v1beta1" audit-ID="7a3c4de3-18f0-4846-94e2-104be13e4ff1"
E0712 08:46:49.149401       1 writers.go:117] apiserver was unable to write a JSON response: http2: stream closed
E0712 08:46:49.149460       1 writers.go:130] apiserver was unable to write a fallback JSON response: http2: stream closed
E0712 08:46:49.149801       1 writers.go:111] apiserver was unable to close cleanly the response writer: http2: stream closed
E0712 08:46:49.149857       1 wrap.go:54] timeout or abort while handling: method=GET URI="/apis/custom.metrics.k8s.io/v1beta1" audit-ID="7d8e9fdb-3f23-4fa8-952a-8f6d3daeb258"
E0712 08:46:49.150521       1 wrap.go:54] timeout or abort while handling: method=GET URI="/apis/custom.metrics.k8s.io/v1beta2" audit-ID="ed232e12-b6e4-4f1e-a407-73c65c7d23ab"
E0712 08:46:49.152893       1 timeout.go:135] post-timeout activity - time-elapsed: 4.392273ms, GET "/apis/custom.metrics.k8s.io/v1beta1" result: <nil>
E0712 08:46:49.156903       1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http2: stream closed"}: http2: stream closed
E0712 08:46:49.157252       1 timeout.go:135] post-timeout activity - time-elapsed: 7.365381ms, GET "/apis/custom.metrics.k8s.io/v1beta1" result: <nil>
E0712 08:46:49.163731       1 writers.go:117] apiserver was unable to write a JSON response: http2: stream closed
E0712 08:46:49.163798       1 wrap.go:54] timeout or abort while handling: method=GET URI="/apis/custom.metrics.k8s.io/v1beta1" audit-ID="3ff962e1-5437-4282-a35a-5c5d7e5e5c5a"
E0712 08:46:49.164040       1 writers.go:117] apiserver was unable to write a JSON response: http2: stream closed
E0712 08:46:49.164435       1 writers.go:117] apiserver was unable to write a JSON response: http2: stream closed
E0712 08:46:49.164811       1 wrap.go:54] timeout or abort while handling: method=GET URI="/apis/custom.metrics.k8s.io/v1beta2" audit-ID="b69d7b5d-fef9-4ed0-9de7-1474505b516e"
E0712 08:46:49.165787       1 writers.go:117] apiserver was unable to write a JSON response: http2: stream closed
E0712 08:46:49.166040       1 wrap.go:54] timeout or abort while handling: method=GET URI="/apis/custom.metrics.k8s.io/v1beta1" audit-ID="7afb8bb4-c92e-4004-9a9e-d5573060011e"
E0712 08:46:49.166264       1 writers.go:130] apiserver was unable to write a fallback JSON response: http2: stream closed
E0712 08:46:49.166924       1 wrap.go:54] timeout or abort while handling: method=GET URI="/apis/custom.metrics.k8s.io/v1beta1" audit-ID="fd644be2-e351-4ad3-8c09-7018cda0c978"
E0712 08:46:49.167584       1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http2: stream closed"}: http2: stream closed
E0712 08:46:49.171427       1 wrap.go:54] timeout or abort while handling: method=GET URI="/apis/custom.metrics.k8s.io/v1beta2" audit-ID="6494d9df-d5bf-402e-ae48-58b130b11625"
E0712 08:46:49.171478       1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http2: stream closed"}: http2: stream closed
E0712 08:46:49.243937       1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http2: stream closed"}: http2: stream closed
E0712 08:46:49.243949       1 writers.go:117] apiserver was unable to write a JSON response: http: Handler timeout
E0712 08:46:49.246406       1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http2: stream closed"}: http2: stream closed
E0712 08:46:49.248929       1 timeout.go:135] post-timeout activity - time-elapsed: 98.366807ms, GET "/apis/custom.metrics.k8s.io/v1beta2" result: <nil>
E0712 08:46:49.251310       1 writers.go:130] apiserver was unable to write a fallback JSON response: http2: stream closed
E0712 08:46:49.253716       1 writers.go:130] apiserver was unable to write a fallback JSON response: http2: stream closed
E0712 08:46:49.254860       1 writers.go:130] apiserver was unable to write a fallback JSON response: http2: stream closed
E0712 08:46:49.255997       1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http: Handler timeout"}: http: Handler timeout
E0712 08:46:49.257172       1 writers.go:130] apiserver was unable to write a fallback JSON response: http2: stream closed
E0712 08:46:49.259616       1 timeout.go:135] post-timeout activity - time-elapsed: 95.781565ms, GET "/apis/custom.metrics.k8s.io/v1beta1" result: <nil>
E0712 08:46:49.260765       1 timeout.go:135] post-timeout activity - time-elapsed: 94.56363ms, GET "/apis/custom.metrics.k8s.io/v1beta1" result: <nil>
E0712 08:46:49.261896       1 timeout.go:135] post-timeout activity - time-elapsed: 95.660411ms, GET "/apis/custom.metrics.k8s.io/v1beta2" result: <nil>
E0712 08:46:49.262878       1 writers.go:130] apiserver was unable to write a fallback JSON response: http: Handler timeout
E0712 08:46:49.264115       1 timeout.go:135] post-timeout activity - time-elapsed: 96.887285ms, GET "/apis/custom.metrics.k8s.io/v1beta1" result: <nil>
E0712 08:46:49.268715       1 timeout.go:135] post-timeout activity - time-elapsed: 97.200859ms, GET "/apis/custom.metrics.k8s.io/v1beta2" result: <nil>

Are you using this custom metrics adapter or are you using another one? If you are using this one, are you also getting these errors? If not, how did you set that up?

from k8s-stackdriver.

AnthonMS commented on August 16, 2024

I have just tried setting it up again with the different version number. And my items is still empty unfortunately.

As I mention above, I'm pretty sure it's the google stackdriver adapter that's causing me trouble now. I have set it up as I said and I am not getting 403 forbidden errors anymore, like I did in the beginning. But as the errors above suggest, then there are still something wrong with the adapter. It is getting <nil> when fetching the custom metrics. I have no idea what is going on.

When running the command:
kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/namespaces/default/pods/*/phpfpm_active_processes"
The result I get back is:
{"kind":"MetricValueList","apiVersion":"custom.metrics.k8s.io/v1beta1","metadata":{"selfLink":"/apis/custom.metrics.k8s.io/v1beta1/namespaces/default/pods/%2A/phpfpm_active_processes"},"items":[]}
And my containers are running in the default namespace.

Is there a command to check all namespaces for that metric? Just for fun. I'm still new to k8s and trying to learn as much as I can. I have no idea at this point what can be causing issues other than the stackdriver. And I'm kinda over it at this point.

from k8s-stackdriver.

JamesMarino commented on August 16, 2024

I have a suspicion that there might be some issues when the phpfpm data is being sent to Metrics Explorer. I was getting the same error message similar to the below:

stackdriver.go:60] Error while sending request to Stackdriver json: error calling MarshalJSON for type *monitoring.CreateTimeSeriesRequest: json: error calling MarshalJSON for type *monitoring.TimeSeries: json: error calling MarshalJSON for type *monitoring.Point: json: error calling MarshalJSON for type *monitoring.TypedValue: json: error calling MarshalJSON for type *monitoring.Distribution: json: unsupported value: NaN

My PHP FPM metrics were not being sent at all to Metrics Explorer but after stepping through the code here -

k8s-stackdriver/prometheus-to-sd/translator/stackdriver.go

Lines 54 to 55 in 0463e9b

 defer wg.Done() 

 req := &v3.CreateTimeSeriesRequest{TimeSeries: ts[begin:end]}

and removing ts elements of the slice with the problematic Distribution values I could get the phpfpm metrics to work:

defer wg.Done()

var timeSeries []*v3.TimeSeries
for _, singleTimeSeries := range ts[begin:end] {
	anyDistributionValuesFound := false

	for _, point := range singleTimeSeries.Points {
		if point.Value.DistributionValue != nil {
			anyDistributionValuesFound = true
		}
	}

	if !anyDistributionValuesFound {
		timeSeries = append(timeSeries, singleTimeSeries)
	}
}

req := &v3.CreateTimeSeriesRequest{TimeSeries: timeSeries}

What I assume is happening is further down the line when this bulk ts []*v3.TimeSeries is being sent, depending on what element in the slice the phpfpm metrics are they will or will not be sent as it will error out once a bad Distribution value is to be sent thus not sending all the remaining metrics.

This is by no mean a fix for the underlying problem but just an observation / quick fix I was able to put in place - I assume these issues with the Distribution metrics are happening upstream somewhere.

from k8s-stackdriver.

error calling MarshalJSON / unsupported value: NaN about k8s-stackdriver HOT 7 OPEN

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	defer wg.Done()
	req := &v3.CreateTimeSeriesRequest{TimeSeries: ts[begin:end]}