I switched our prometheus e2e tests to use the cluster monitoring operator and I'm see

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

I see alertmanager pods here so they get created at some point <a href="https://gcsweb

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

node-exporter does not come up on openshift e2e runs about cluster-monitoring-operator HOT 17 CLOSED

openshift commented on July 18, 2024

node-exporter does not come up on openshift e2e runs

from cluster-monitoring-operator.

Comments (17)

smarterclayton commented on July 18, 2024 1

@openshift/sig-storage

from cluster-monitoring-operator.

smarterclayton commented on July 18, 2024

@brancz a significant source of flakes in the 3.11 e2es

from cluster-monitoring-operator.

brancz commented on July 18, 2024

@mxinden those warnings were recently fixed, no? Did we update the Prometheus Operator version?

from cluster-monitoring-operator.

brancz commented on July 18, 2024

Something I could think of off the top of my head is, can you check whether that annotation is set to not enforce pods to go on the worker nodes?

// edit

this:

cluster-monitoring-operator/scripts/deploy-on-openshift.sh

Line 20 in 2500bad

oc annotate ns/openshift-monitoring openshift.io/node-selector=

but looks like we do do that in the ansible roles

https://github.com/openshift/openshift-ansible/blob/b4443ae1bf649c61a1a1ca04076e82ce69d45520/roles/openshift_cluster_monitoring_operator/tasks/install.yaml#L44

from cluster-monitoring-operator.

smarterclayton commented on July 18, 2024

We'd be getting events if we couldn't schedule.

from cluster-monitoring-operator.

brancz commented on July 18, 2024

Ah I actually found something in the logs of the cluster-monitoring-operator:

E0903 13:43:24.094244       1 operator.go:186] Syncing "openshift-monitoring/cluster-monitoring-config" failed                                                                           
E0903 13:43:24.187314       1 operator.go:187] Sync "openshift-monitoring/cluster-monitoring-config" failed: running task Updating Alertmanager failed: waiting for Alertmanager object changes failed: timed out waiting for the condition

What do the Alertmanager logs say or do the Pods even created/scheduled from its StatefulSet?

from cluster-monitoring-operator.

mxinden commented on July 18, 2024

For the record Index with name namespace does not exist was fixed in prometheus-operator/prometheus-operator#1706. Should just be ignored as client-go falls back to simply iterating the given store.

from cluster-monitoring-operator.

smarterclayton commented on July 18, 2024

I see alertmanager pods here so they get created at some point https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/pr-logs/pull/20830/pull-ci-origin-e2e-gcp/3161/artifacts/e2e-gcp/pods/

from cluster-monitoring-operator.

s-urbaniak commented on July 18, 2024

@smarterclayton Thanks a lot for the logs and pods traces, this helps tremendously! 👍

I think we can narrow it down to the cluster monitoring operator not reaching the point to create the daemonset node exporter at all.

In fact it also doesn't create the kube-state-metrics deployment (both artifacts are missing in the pods overview).

While looking at the operator logic:

cluster-monitoring-operator/pkg/operator/operator.go

Lines 238 to 244 in d6d5b11

 tasks.NewTaskSpec("Updating Prometheus Operator", tasks.NewPrometheusOperatorTask(o.client, factory)), 

 tasks.NewTaskSpec("Updating Cluster Monitoring Operator", tasks.NewClusterMonitoringOperatorTask(o.client, factory)), 

 tasks.NewTaskSpec("Updating Grafana", tasks.NewGrafanaTask(o.client, factory)), 

 tasks.NewTaskSpec("Updating Prometheus-k8s", tasks.NewPrometheusTask(o.client, factory, config)), 

 tasks.NewTaskSpec("Updating Alertmanager", tasks.NewAlertmanagerTask(o.client, factory)), 

 tasks.NewTaskSpec("Updating node-exporter", tasks.NewNodeExporterTask(o.client, factory)), 

 tasks.NewTaskSpec("Updating kube-state-metrics", tasks.NewKubeStateMetricsTask(o.client, factory)),

I see the following log entries from https://storage.googleapis.com/origin-ci-test/pr-logs/pull/20830/pull-ci-origin-e2e-gcp/3161/artifacts/e2e-gcp/pods/openshift-monitoring_cluster-monitoring-operator-5cf8fccc6-mdc92_cluster-monitoring-operator.log.gz:

I0903 13:55:04.687436       1 tasks.go:37] running task Updating Prometheus Operator
...
I0903 13:55:23.094065       1 tasks.go:37] running task Updating Grafana
...
I0903 13:55:30.198715       1 tasks.go:37] running task Updating Prometheus-k8s
...
I0903 14:01:30.888512       1 tasks.go:37] running task Updating Alertmanager
E0903 13:49:12.972447       1 operator.go:186] Syncing "openshift-monitoring/cluster-monitoring-config" failed
E0903 13:49:12.972481       1 operator.go:187] Sync "openshift-monitoring/cluster-monitoring-config" failed: running task Updating Alertmanager failed: waiting for Alertmanager object changes failed: timed out waiting for the condition
...
(retries the above a couple of times)
...
I0903 14:01:30.888512       1 tasks.go:37] running task Updating Alertmanager
I0903 14:01:30.888611       1 decoder.go:224] decoding stream as YAML
I0903 14:01:32.090741       1 decoder.go:224] decoding stream as YAML
I0903 14:01:32.187570       1 decoder.go:224] decoding stream as YAML
I0903 14:01:32.287753       1 decoder.go:224] decoding stream as YAML
I0903 14:01:32.304566       1 decoder.go:224] decoding stream as YAML
I0903 14:01:32.487412       1 decoder.go:224] decoding stream as YAML
I0903 14:01:32.490592       1 decoder.go:224] decoding stream as YAML
I0903 14:01:32.493077       1 decoder.go:224] decoding stream as YAML
I0903 14:01:32.587637       1 decoder.go:224] decoding stream as YAML
<EOF>

This might be a simple flake, where the e2e tests simply gives up "too fast", admittedly it takes a long time (~15 minutes).

@smarterclayton @brancz : are the monitoring (and also alertmanager) images downloaded from the internet or are they cached internally from Openshift?

from cluster-monitoring-operator.

brancz commented on July 18, 2024

It would be pretty invasive, but possible to parallelize some of these things. Right now the big dependency there is, is that the Prometheus Operator is setup first, then the only remaining dependency is between Prometheus and Grafana as the Grafana task sets up some resources that the Prometheus task depends on.

from cluster-monitoring-operator.

s-urbaniak commented on July 18, 2024

Just some more data points from the event log, why alertmanager was reluctant to be started, especially alertmanager-main-2.

It seems the GCP persistend disks have quite some hickups when mounting. From https://storage.googleapis.com/origin-ci-test/pr-logs/pull/20830/pull-ci-origin-e2e-gcp/3161/artifacts/e2e-gcp/events.json:

            "firstTimestamp": "2018-09-03T13:39:44Z",
            "lastTimestamp": "2018-09-03T13:39:46Z",
            "message": "pod has unbound PersistentVolumeClaims (repeated 3 times)",
...
            "firstTimestamp": "2018-09-03T13:39:46Z",
            "lastTimestamp": "2018-09-03T13:39:46Z",
            "message": "Successfully assigned openshift-monitoring/alertmanager-main-2 to ci-op-3nmhd3lm-eb354-ig-n-w1lk",
...
            "firstTimestamp": "2018-09-03T13:39:54Z",
            "lastTimestamp": "2018-09-03T13:39:54Z",
            "message": "AttachVolume.Attach succeeded for volume \"pvc-d0f130e7-af7e-11e8-8cd1-42010a8e0005\" ",
...
            "firstTimestamp": "2018-09-03T13:41:49Z",
            "lastTimestamp": "2018-09-03T14:02:13Z",
            "message": "Unable to mount volumes for pod \"alertmanager-main-2_openshift-monitoring(d0f871a5-af7e-11e8-8cd1-42010a8e0005)\": timeout expired waiting for volumes to attach or mount for pod \"openshift-monitoring\"/\"alertmanager-main-2\". list of unmounted volumes=[alertmanager-main-db]. list of unattached volumes=[alertmanager-main-db config-volume secret-alertmanager-main-tls secret-alertmanager-main-proxy alertmanager-main-token-f24kl]",
...
            "firstTimestamp": "2018-09-03T13:50:03Z",
            "lastTimestamp": "2018-09-03T14:00:03Z",
            "message": "MountVolume.WaitForAttach failed for volume \"pvc-d0f130e7-af7e-11e8-8cd1-42010a8e0005\" : Could not find attached GCE PD \"kubernetes-dynamic-pvc-d0f130e7-af7e-11e8-8cd1-42010a8e0005\". Timeout waiting for mount paths to be created.",

from cluster-monitoring-operator.

brancz commented on July 18, 2024

@smarterclayton just to see if our suspicion is correct, would it be possible to increase the timeout to see if it eventually deploys?

from cluster-monitoring-operator.

smarterclayton commented on July 18, 2024

Which timeout? If we think this is a GCE PD problem we should copy the storage team in - I've never noticed this before, but if PVs are lagging we'd want to know.

…

On Wed, Sep 5, 2018 at 3:49 PM Frederic Branczyk ***@***.***> wrote: @smarterclayton <https://github.com/smarterclayton> just to see if our suspicion is correct, would it be possible to increase the timeout to see if it eventually deploys? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#85 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABG_p1mIdwDD8MlIrFVo7mIiBQdamlmRks5uYCqugaJpZM4WXvvA> .

from cluster-monitoring-operator.

brancz commented on July 18, 2024

Yes good point.

from cluster-monitoring-operator.

s-urbaniak commented on July 18, 2024

@smarterclayton do you have any references to people in the storage team we can ping here? It seems there is not much we can do in the cluster-monitoring-operator itself.

from cluster-monitoring-operator.

smarterclayton commented on July 18, 2024

Opened https://bugzilla.redhat.com/show_bug.cgi?id=1627547

from cluster-monitoring-operator.

s-urbaniak commented on July 18, 2024

closing this out here, as it does not seem to be related to the cluster monitoring operator.

from cluster-monitoring-operator.

node-exporter does not come up on openshift e2e runs about cluster-monitoring-operator HOT 17 CLOSED

Comments (17)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	tasks.NewTaskSpec("Updating Prometheus Operator", tasks.NewPrometheusOperatorTask(o.client, factory)),
	tasks.NewTaskSpec("Updating Cluster Monitoring Operator", tasks.NewClusterMonitoringOperatorTask(o.client, factory)),
	tasks.NewTaskSpec("Updating Grafana", tasks.NewGrafanaTask(o.client, factory)),
	tasks.NewTaskSpec("Updating Prometheus-k8s", tasks.NewPrometheusTask(o.client, factory, config)),
	tasks.NewTaskSpec("Updating Alertmanager", tasks.NewAlertmanagerTask(o.client, factory)),
	tasks.NewTaskSpec("Updating node-exporter", tasks.NewNodeExporterTask(o.client, factory)),
	tasks.NewTaskSpec("Updating kube-state-metrics", tasks.NewKubeStateMetricsTask(o.client, factory)),