Comments (17)
@openshift/sig-storage
from cluster-monitoring-operator.
@brancz a significant source of flakes in the 3.11 e2es
from cluster-monitoring-operator.
@mxinden those warnings were recently fixed, no? Did we update the Prometheus Operator version?
from cluster-monitoring-operator.
Something I could think of off the top of my head is, can you check whether that annotation is set to not enforce pods to go on the worker nodes?
// edit
this:
but looks like we do do that in the ansible roles
from cluster-monitoring-operator.
We'd be getting events if we couldn't schedule.
from cluster-monitoring-operator.
Ah I actually found something in the logs of the cluster-monitoring-operator:
E0903 13:43:24.094244 1 operator.go:186] Syncing "openshift-monitoring/cluster-monitoring-config" failed
E0903 13:43:24.187314 1 operator.go:187] Sync "openshift-monitoring/cluster-monitoring-config" failed: running task Updating Alertmanager failed: waiting for Alertmanager object changes failed: timed out waiting for the condition
What do the Alertmanager logs say or do the Pods even created/scheduled from its StatefulSet?
from cluster-monitoring-operator.
For the record Index with name namespace does not exist
was fixed in prometheus-operator/prometheus-operator#1706. Should just be ignored as client-go falls back to simply iterating the given store.
from cluster-monitoring-operator.
I see alertmanager pods here so they get created at some point https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/pr-logs/pull/20830/pull-ci-origin-e2e-gcp/3161/artifacts/e2e-gcp/pods/
from cluster-monitoring-operator.
@smarterclayton Thanks a lot for the logs and pods traces, this helps tremendously! 👍
I think we can narrow it down to the cluster monitoring operator not reaching the point to create the daemonset node exporter at all.
In fact it also doesn't create the kube-state-metrics deployment (both artifacts are missing in the pods overview).
While looking at the operator logic:
cluster-monitoring-operator/pkg/operator/operator.go
Lines 238 to 244 in d6d5b11
I see the following log entries from https://storage.googleapis.com/origin-ci-test/pr-logs/pull/20830/pull-ci-origin-e2e-gcp/3161/artifacts/e2e-gcp/pods/openshift-monitoring_cluster-monitoring-operator-5cf8fccc6-mdc92_cluster-monitoring-operator.log.gz:
I0903 13:55:04.687436 1 tasks.go:37] running task Updating Prometheus Operator
...
I0903 13:55:23.094065 1 tasks.go:37] running task Updating Grafana
...
I0903 13:55:30.198715 1 tasks.go:37] running task Updating Prometheus-k8s
...
I0903 14:01:30.888512 1 tasks.go:37] running task Updating Alertmanager
E0903 13:49:12.972447 1 operator.go:186] Syncing "openshift-monitoring/cluster-monitoring-config" failed
E0903 13:49:12.972481 1 operator.go:187] Sync "openshift-monitoring/cluster-monitoring-config" failed: running task Updating Alertmanager failed: waiting for Alertmanager object changes failed: timed out waiting for the condition
...
(retries the above a couple of times)
...
I0903 14:01:30.888512 1 tasks.go:37] running task Updating Alertmanager
I0903 14:01:30.888611 1 decoder.go:224] decoding stream as YAML
I0903 14:01:32.090741 1 decoder.go:224] decoding stream as YAML
I0903 14:01:32.187570 1 decoder.go:224] decoding stream as YAML
I0903 14:01:32.287753 1 decoder.go:224] decoding stream as YAML
I0903 14:01:32.304566 1 decoder.go:224] decoding stream as YAML
I0903 14:01:32.487412 1 decoder.go:224] decoding stream as YAML
I0903 14:01:32.490592 1 decoder.go:224] decoding stream as YAML
I0903 14:01:32.493077 1 decoder.go:224] decoding stream as YAML
I0903 14:01:32.587637 1 decoder.go:224] decoding stream as YAML
<EOF>
This might be a simple flake, where the e2e tests simply gives up "too fast", admittedly it takes a long time (~15 minutes).
@smarterclayton @brancz : are the monitoring (and also alertmanager) images downloaded from the internet or are they cached internally from Openshift?
from cluster-monitoring-operator.
It would be pretty invasive, but possible to parallelize some of these things. Right now the big dependency there is, is that the Prometheus Operator is setup first, then the only remaining dependency is between Prometheus and Grafana as the Grafana task sets up some resources that the Prometheus task depends on.
from cluster-monitoring-operator.
Just some more data points from the event log, why alertmanager was reluctant to be started, especially alertmanager-main-2.
It seems the GCP persistend disks have quite some hickups when mounting. From https://storage.googleapis.com/origin-ci-test/pr-logs/pull/20830/pull-ci-origin-e2e-gcp/3161/artifacts/e2e-gcp/events.json:
"firstTimestamp": "2018-09-03T13:39:44Z",
"lastTimestamp": "2018-09-03T13:39:46Z",
"message": "pod has unbound PersistentVolumeClaims (repeated 3 times)",
...
"firstTimestamp": "2018-09-03T13:39:46Z",
"lastTimestamp": "2018-09-03T13:39:46Z",
"message": "Successfully assigned openshift-monitoring/alertmanager-main-2 to ci-op-3nmhd3lm-eb354-ig-n-w1lk",
...
"firstTimestamp": "2018-09-03T13:39:54Z",
"lastTimestamp": "2018-09-03T13:39:54Z",
"message": "AttachVolume.Attach succeeded for volume \"pvc-d0f130e7-af7e-11e8-8cd1-42010a8e0005\" ",
...
"firstTimestamp": "2018-09-03T13:41:49Z",
"lastTimestamp": "2018-09-03T14:02:13Z",
"message": "Unable to mount volumes for pod \"alertmanager-main-2_openshift-monitoring(d0f871a5-af7e-11e8-8cd1-42010a8e0005)\": timeout expired waiting for volumes to attach or mount for pod \"openshift-monitoring\"/\"alertmanager-main-2\". list of unmounted volumes=[alertmanager-main-db]. list of unattached volumes=[alertmanager-main-db config-volume secret-alertmanager-main-tls secret-alertmanager-main-proxy alertmanager-main-token-f24kl]",
...
"firstTimestamp": "2018-09-03T13:50:03Z",
"lastTimestamp": "2018-09-03T14:00:03Z",
"message": "MountVolume.WaitForAttach failed for volume \"pvc-d0f130e7-af7e-11e8-8cd1-42010a8e0005\" : Could not find attached GCE PD \"kubernetes-dynamic-pvc-d0f130e7-af7e-11e8-8cd1-42010a8e0005\". Timeout waiting for mount paths to be created.",
from cluster-monitoring-operator.
@smarterclayton just to see if our suspicion is correct, would it be possible to increase the timeout to see if it eventually deploys?
from cluster-monitoring-operator.
from cluster-monitoring-operator.
Yes good point.
from cluster-monitoring-operator.
@smarterclayton do you have any references to people in the storage team we can ping here? It seems there is not much we can do in the cluster-monitoring-operator itself.
from cluster-monitoring-operator.
Opened https://bugzilla.redhat.com/show_bug.cgi?id=1627547
from cluster-monitoring-operator.
closing this out here, as it does not seem to be related to the cluster monitoring operator.
from cluster-monitoring-operator.
Related Issues (20)
- Support `Probe` resources HOT 3
- {Request} Ability to configure Prometheus Exporters HOT 5
- Invalid externalURL since ad2d747f2ef037647ebc9f7d5b22118d0dab4bd2 HOT 4
- Release 4.12 Checklist HOT 3
- Missing Metrics in Prometheus on openshift 3.11 HOT 4
- User workload prometheus scraps targets managed by monitoring prometheus instance HOT 4
- Release 4.13 Checklist HOT 5
- crio metrics via http is not safe HOT 7
- Allow kube-state-metrics configuration HOT 3
- Reject invalid PrometheusRule objects beforehand rather than failing the reconciliation HOT 5
- Alerts generated by user-workload-monitoring are not included in remoteWrite HOT 4
- typo in 4.14 CHANGELOG.md and CONTRIBUTING.md HOT 3
- OpenTelemetry Support HOT 16
- kubelet metrics not working in after upgrade to OKD 4.13 HOT 12
- wrong annotation for ThanosRulerConfig.Resources HOT 1
- I can't increase Prometheus startupProbe failureThreshold value HOT 6
- user-workload-monitoring doesn't support honoring labels HOT 4
- Additional Scrape Configuration is not supported HOT 8
- How to change default servicemonitors interval HOT 5
- Support enableRemoteWriteReceiver flag in prometheus{,K8s} config HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from cluster-monitoring-operator.