openshift / cluster-monitoring-operator Goto Github PK

Manage the OpenShift monitoring stack

License: Apache License 2.0

Makefile 0.60% Go 76.40% Shell 1.86% Dockerfile 0.07% Jsonnet 21.07%

cluster-monitoring-operator's Introduction

Cluster Monitoring Operator

The Cluster Monitoring Operator manages and updates the Prometheus-based cluster monitoring stack deployed on top of OpenShift.

It contains the following components:

Prometheus Operator
Prometheus
Alertmanager cluster for cluster and application level alerting
kube-state-metrics
node_exporter
prometheus-adapter
kubernetes-metrics-server only on TechPreviewNoUpgrade cluster

The deployed Prometheus instance (prometheus-k8s) is responsible for monitoring and alerting on cluster and OpenShift components; it should not be extended to monitor user applications. Users interested in leveraging Prometheus for application monitoring on OpenShift should consider enabling User Workload Monitoring to easily setup new Prometheus instances to monitor and alert on their applications.

Alertmanager is a cluster-global component for handling alerts generated by all Prometheus instances deployed in that cluster.

Adding new metrics to be sent via telemetry

To add new metrics to be sent via telemetry, simply add a selector that matches the time-series to be sent in manifests/0000_50_cluster-monitoring-operator_04-config.yaml.

Documentation on the data sent can be found in the data collection documentation.

Contributing

Please refer to the CONTRIBUTING.md document for information.

Release

Release checklist is available when creating new "Release Checklist" issue.

cluster-monitoring-operator's People

Contributors

Stargazers

Watchers

Forkers

brancz ironcladlou philips zbwright markjacksonfishing mxinden lukas-vlcek jhadvig dwdraju kyoto chancez jeremyeder simonpasquier chrisricci sosiouxme mrsiano tamalsaha sichvoge squat b43646 pillai-ashwin s-urbaniak fahlmant bparees aveshagarwal metalmatze openshift-cherrypick-robot pabrahamsson abhinavdahiya aditya-konarde deads2k florinpeter kierranm pgier domino7 smarterclayton umangachapagain rogbas ahadas adesso-as-a-service suicidesin maxdiorio keerthivel28 sjenning bostrt spawn1978 yaacov allen13 wking bysnupy healthpartners henrylv206 jim-minter spaparaju alecmerdler sogrand ghaliba3 derekwaynecarr spagno tschuy rafael-azevedo oip-rnd tsungming stlaz darthlukan paulfantom sabreoss madchr1st suneetb soltysh ravisantoshgudimetla runyontr mfojtik squeed restoreall srbrcs wanghaoran1988 mariadeanton rphillips charlesakalugwu ss75710541 jpabegg christopherl91 xujintao1996 jaormx lokesh0421 abyrne55 desaintmartin michalbolek lilic enxebre wmuha nationminu chambridge p0lyn0mial thpham cshinn benjaminapetersen danielbelenky cblecker

cluster-monitoring-operator's Issues

Use specific template for altertmanager

Hello, is there any way to use custom template for alertmanager?

.../opsgenie.tmpl:/etc/alertmanager/templates/opsgenie.tmpl:ro

and then using it

  - name: opsgenie
    opsgenie_configs:
      - api_key: ...
        send_resolved: true
        teams: SuperTeam
        tags: '{{ template "opsgenie.default.tags" .  }}'
        message: '{{ template "opsgenie.default.message" . }}'
        source: '{{ template "opsgenie.default.source" . }}'
        description: '{{ template "opsgenie.default.description" . }}'
        priority: '{{ template "opsgenie.default.priority_mapper" . }}'

Add Prometheus instance to monitor applications

Is there any documentation showing how to properly add a Prometheus instance dedicated to application while using the same alterting component used by the cluster Monitoring stack

how to change kube-rbac-proxy version for node-exporter

Hello Team,

We would like to know how to change the version for kube-rbac-proxy used by the node-exporter daemonset, as the current 0.3.1 version with OSO 3.11 allows SSLv3 and Tlsv1. kube-rbac-proxy 0.4.1 seems to have the fix for it. I tried changing the version in DaemonSet but it reverts back to older version.

Thanks,
Sreekanth

Change Version of image

Hi,
I try to change the version of my Prometheus image in a test environment.
Therefore I use the cluster-monitoring-config, but the version parameter has no effect.

...
prometheusK8s:
  baseImage: openshift/prometheus
  nodeSelector:
    node-role.kubernetes.io/infra: "true"
  externalLabels:
    cluster: s-openshift.mycompany.com
  version: v2.5.0
...

Is the version hard coded in cluster-monitoring-operator?

Thanks for your help in advance

Readme clarification

In the readme file it is stated that :
Users interested in leveraging Prometheus for application monitoring on OpenShift should consider using OLM to easily deploy a Prometheus Operator and setup new Prometheus instances to monitor and alert on their applications.
This sounds a bit confusing. Does it mean that in order to monitor application we have to create a new instance of Prometheus with the help of the « Prometheus Operator » provided by the « OpenShift cluster monitoring operator ». Or do we have to install a stand-alone « Prometheus operator » ?
If we can use Prometheus operator from cluster monitoring operator for application monitoring does it have to be in the same project (openshift-monitoring) or do we have to provision the app-prometheus in same namespace of the application ?

openshift/cluster-monitoring-operator on OKD 3.9

Hello,

 I would like to konw if anyone have an experience to install CMO on 3.9? I have a cluster OKD 3.11 using CMO but, i don if it works well on 3.9?

 Anyone Have already installed CMO on 3.9?

Thanks

Timezone problem with kube-state-metrics

Hi
I updated my cluster yesterday with openshift-ansible with this commit openshift/openshift-ansible@8c77207
This commit changed the timezone in api, controller and etcd.
kube-state-metrics pod is still in UTC timezone and I get exactly this issue: kubernetes/kube-state-metrics#500

What can I do? It is possible to set the timezone also in kube-state-metric pod?

Openshift-Version:

oc version
oc v3.11.0+b6db8e6-107
kubernetes v1.11.0+d4cacc0
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://s-cp-lb-01.cloud.mycompany.de:443
openshift v3.11.0+d0c29df-98
kubernetes v1.11.0+d4cacc0

If you need more information let me know.

node-exporter does not come up on openshift e2e runs

I switched our prometheus e2e tests to use the cluster monitoring operator and I'm seeing some failures in about 1/4 runs. The most noticeable is that one run didn't have the node exporter installed (no pods created).

https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/20830/pull-ci-origin-e2e-gcp/3161/

/tmp/openshift/build-rpms/rpm/BUILD/origin-3.11.0/_output/local/go/src/github.com/openshift/origin/test/extended/prometheus/prometheus.go:49
Expected
    <[]error | len:1, cap:1>: [
        {
            s: "no match for map[job:node-exporter] with health up and scrape URL ^https://.*/metrics$",
        },
    ]
to be empty
/tmp/openshift/build-rpms/rpm/BUILD/origin-3.11.0/_output/local/go/src/github.com/openshift/origin/test/extended/prometheus/prometheus.go:123

In this run the e2e tests start at 13:37, but the prometheus test isn't run until 13:45, which should be more than enough time for node-exporter to come up. I see no pods created, which implies either the daemonset wasn't created, or the daemonset failed massively. I see no events for the daemonset in
https://storage.googleapis.com/origin-ci-test/pr-logs/pull/20830/pull-ci-origin-e2e-gcp/3161/artifacts/e2e-gcp/events.json which implies it didn't get created.

I see the following in the logs for prometheus operator (which seems bad) but nothing in cluster monitoring operator that is excessive.

https://storage.googleapis.com/origin-ci-test/pr-logs/pull/20830/pull-ci-origin-e2e-gcp/3161/artifacts/e2e-gcp/pods/openshift-monitoring_cluster-monitoring-operator-5cf8fccc6-mdc92_cluster-monitoring-operator.log.gz

https://storage.googleapis.com/origin-ci-test/pr-logs/pull/20830/pull-ci-origin-e2e-gcp/3161/artifacts/e2e-gcp/pods/openshift-monitoring_prometheus-operator-6c9fddd47f-mb4br_prometheus-operator.log.gz

W0903 14:01:42.210505       1 listers.go:63] can not retrieve list of objects using index : Index with name namespace does not exist
level=info ts=2018-09-03T14:01:43.178075004Z caller=operator.go:732 component=prometheusoperator msg="sync prometheus" key=openshift-monitoring/k8s
W0903 14:01:43.178176       1 listers.go:63] can not retrieve list of objects using index : Index with name namespace does not exist
W0903 14:01:43.178307       1 listers.go:63] can not retrieve list of objects using index : Index with name namespace does not exist
W0903 14:01:43.196450       1 listers.go:63] can not retrieve list of objects using index : Index with name namespace does not exist
W0903 14:01:43.222385       1 listers.go:63] can not retrieve list of objects using index : Index with name namespace does not exist
W0903 14:01:43.222448       1 listers.go:63] can not retrieve list of objects using index : Index with name namespace does not exist
level=info ts=2018-09-03T14:01:43.222295876Z caller=operator.go:732 component=prometheusoperator msg="sync prometheus" key=openshift-monitoring/k8s
W0903 14:01:43.240970       1 listers.go:63] can not retrieve list of objects using index : Index with name namespace does not exist
W0903 14:02:03.033696       1 listers.go:63] can not retrieve list of objects using index : Index with name namespace does not exist
level=info ts=2018-09-03T14:02:03.033607297Z caller=operator.go:732 component=prometheusoperator msg="sync prometheus" key=openshift-monitoring/k8s
W0903 14:02:03.033767       1 listers.go:63] can not retrieve list of objects using index : Index with name namespace does not exist
W0903 14:02:03.048325       1 listers.go:63] can not retrieve list of objects using index : Index with name namespace does not exist
level=info ts=2018-09-03T14:02:19.767518749Z caller=operator.go:396 component=alertmanageroperator msg="sync alertmanager" key=openshift-monitoring/main
W0903 14:02:45.489186       1 listers.go:63] can not retrieve list of objects using index : Index with name namespace does not exist
W0903 14:02:45.489268       1 listers.go:63] can not retrieve list of objects using index : Index with name namespace does not exist
level=info ts=2018-09-03T14:02:45.489057156Z caller=operator.go:732 component=prometheusoperator msg="sync prometheus" key=openshift-monitoring/k8s
W0903 14:02:45.504357       1 listers.go:63] can not retrieve list of objects using index : Index with name namespace does not exist

Grafana image fails with `ImageInspectError`

openshift-install v0.3.0-273-gc620a9bbe00e21c77a9b2047af5ae01c2c95acc5-dirty

cluster-monitoring-operator image registry.svc.ci.openshift.org/openshift/origin-v4.0-20181122090449@sha256:b3b006f44267099142b1b98d841aa1f14e5bf7001f65df58dda606b263aa905c

# oc get pods -n openshift-monitoring 
NAME                                           READY     STATUS              RESTARTS   AGE
cluster-monitoring-operator-5b98c6bff9-p7xlv   1/1       Running             0          9m
grafana-84d8fdb777-97h2h                       0/2       ImageInspectError   0          8m
prometheus-operator-5bfd54f894-x2924           1/1       Running             0          9m

# oc get pod -o yaml grafana-84d8fdb777-97h2h -n openshift-monitoring
...
    state:
      waiting:
        message: 'Failed to inspect image "grafana/grafana:5.2.4": rpc error: code
          = Unknown desc = no registries configured while trying to pull an unqualified
          image'
        reason: ImageInspectError
...
    state:
      waiting:
        message: 'Failed to inspect image "openshift/oauth-proxy:v1.1.0": rpc error:
          code = Unknown desc = no registries configured while trying to pull an unqualified
          image'
        reason: ImageInspectError

Prometheus CR not created by cluster-monitoring-operator

Noticed on a 4.0 cluster today

$ oc project
Using project "openshift-monitoring" on server ...

$ oc get pod
NAME                                         READY     STATUS    RESTARTS   AGE
cluster-monitoring-operator-9477c48c-cl4rn   1/1       Running   0          10m
grafana-84d8fdb777-8nnvk                     1/2       Running   0          25m
prometheus-operator-5bfd54f894-tm9ff         1/1       Running   0          14m

$ oc get prometheuses
No resources found.

@pgier 

fyi @RobertKrawitz @rphillips

How to get pod labels or annontations using servicemonitor?

I have a Prometheus Operator working fine and scrape service label using targetLabels. I would like to know, how can i get labels or annotations from my pods.

Thanks

Improvement for Alert KubeletTooManyPods?

Currently a fix value of 250 is used for the maximum number of pods per node.
But Kubernetes allows to set it per node.

This alert rule could be made a bit more cluster specific via :

max (kubelet_running_pod_count{job="kubelet"}) by (instance)   < on () group_left() max(kube_node_status_capacity_pods{job="kube-state-metrics"}) * 0.9

where kube_node_status_capacity_pods is the number of allowed pods per node from job kube-state-metrics.
Note: the value of kube_node_status_capacity_pods is typically lower for master and infra nodes.

This rule unfortunately cannot be defined on the exact limit per node as the label "node" from the "kube-state-metrics" is not available for metrics from other jobs.
I only found that the node IP and node name can be matched by using relabel_configs from kubernetes_sd_configs with role "node" for kubelet.
Then the node label can be added from the node label "__meta_kubernetes_node_name". The node IP can be extracted from label "address" or label "instance".

Kind Regards,
Ulrike

Reconfigure default setup of Cluster Monitoring Operator

Is it possible to reconfigure default setup of Cluster Monitoring Operator? How to disable resetting setup to default state?
Documentation is a little bit misleading in this, I already described details here: openshift/openshift-docs#12500

Example: I would like to install Cluster Monitoring Operator with OpenShift-Ansible, turn off resetting mechanism and patch configuration (e.g. ServiceMonitorSelector, ServiceMonitorNamespaceSelector) as additional, post-installation tasks to provide applications monitoring in scope of Openshift-Monitoring stack with Ansible playbook.

Any documentation describing customised setups, or any other correct way of providing applications monitoring with the usage of openshift-monitoring playbook will be greatly appreciated as well.

node-exporter can't be configured to tolerate a taint

CMO deployed on OpenShift 3.11:

I have some dedicated nodes that I have tainted for certain pods, but there is no way to add a toleration for the node-exporter (that I can find at least) and the pods are now miss-scheduled on the tainted nodes, and I assume would not be placed back on the nodes should they be deleted, and I'm guessing the operator would revert any changes I make to the daemonset.

Ensure rslave mount propagation in node-exporter

Kubernetes has reverted HostToContainer as the default mount propagation strategy, we now need to make sure we explicitly use this strategy in the node-exporter, otherwise filesystem metrics will be off.

/cc @derekwaynecarr @ironcladlou @elad661 @mxinden

Future Release Branches Frozen For Merging | branch:release-4.5 branch:release-4.4

The following branches are being fast-forwarded from the current development branch (master) as placeholders for future releases. No merging is allowed into these release branches until they are unfrozen for production release.

release-4.5
release-4.4

Contact the Test Platform or Automated Release teams for more information.

Update Kubelet label to use GA label

I took a stab at this in #370 but it looks like the generator is producing the old label (beta.kubernetes.io/os: linux). The new label should be used kubernetes.io/os: linux.

ref: https://jira.coreos.com/browse/POD-114

Unable to run `make generate`

When running make generate fails with following issue:

$ make generate
docker build -t tpo-generate -f Dockerfile.generate .
Sending build context to Docker daemon 37.55 MB
Step 1/2 : FROM golang:1.9.2
---> 138bd936fa29
Step 2/2 : RUN apt-get update && apt-get install -y python-yaml
---> Using cache
---> e0f3c4ea011e
Successfully built e0f3c4ea011e
docker run --rm -v pwd:/go/src/github.com/openshift/cluster-monitoring-operator -w /go/src/github.com/openshift/cluster-monitoring-operator tpo-generate make merge-cluster-roles assets docs
make: stat: Makefile: Permission denied
make: *** No rule to make target 'merge-cluster-roles'. Stop.
make: *** [Makefile:48: generate] Error 2

Not really sure how to solve it.

Heketi / GlusterFS metrics in Prometheus

Hi!,

We are in the progress to migrating our clusters to the 3.11 release and we are wondering if it is possible to enable the Heketi / GlusterFS metrics in the Prometheus set-up? We tried the annotate method as mentioned in a blog article on the Red Hat Storage blog[1] but this is not working

[1] https://redhatstorage.redhat.com/2018/09/14/improved-volume-management-for-red-hat-openshift-container-storage-3-10/

Roles in Grafana

By default, it seems even a user with cluster-admin is not an admin in Grafana, and can not set up things like Alerting. In digging around, it seems there's no way to grant users these roles. Is this intentional or am I perhaps missing a setting somewhere in the operator?

Dead Man's Snitch documentation needs work

The documentation on how to connect Prometheus to a Dead Man's Snitch, with and without Pagerduty, is incomplete:

https://github.com/openshift/cluster-monitoring-operator/blob/9c97908591f4b4fc3b12251c4b6ab0d6289fdbc0/Documentation/user-guides/configuring-prometheus-alertmanager.md

What it should do is mention that you need to set up a generic webhook receiver, direct deadmansswitch alerts to that, and create an integration from Dead Man's Snitch to PagerDuty. Instead, they indicate that you should fire all events to PagerDuty directly.

Maybe I'm wrong, though.

haproxy router monitoring

The default HAProxy OpenShift router exposes prometheus metrics that should be scraped by the prometheus cluster monitoring.

Why are there 3 alertmanager pods?

Is there a technical reason that we require three alertmanager pods?

cluster-monitoring-operator/assets/alertmanager/alertmanager.yaml

Line 62 in dc3bfa1

replicas: 3

jeder@desktop: ~ $ oc get pods -n openshift-monitoring|egrep -i 'alert|name'
NAME                                           READY   STATUS    RESTARTS   AGE
alertmanager-main-0                            3/3     Running   0          46h
alertmanager-main-1                            3/3     Running   0          46h
alertmanager-main-2                            3/3     Running   0          46h

jeder@desktop: ~ $ oc version -o yaml                                                                                                                                                         
clientVersion:
  buildDate: "2019-04-02T17:55:35Z"
  compiler: ""
  gitCommit: acd551fb5
  gitTreeState: ""
  gitVersion: v4.0.22
  goVersion: ""
  major: "4"
  minor: 0+
  platform: ""
serverVersion:
  buildDate: "2019-04-02T23:08:07Z"
  compiler: gc
  gitCommit: 6d43744
  gitTreeState: clean
  gitVersion: v1.13.4+6d43744
  goVersion: go1.10.3
  major: "1"
  minor: 13+
  platform: linux/amd64

jeder@desktop: ~ $ oc get clusteroperators|egrep -i 'progress|monitor'                                                                                                                        
NAME                                 VERSION                           AVAILABLE   PROGRESSING   FAILING   SINCE                                                                              
monitoring                           4.0.0-0.alpha-2019-04-03-121923   True        False         False     2d3h

Update CMO to newer release

Because of company policy I need to run kube-rbac-proxy container in node-exporter with stronger ciphers suite. As the CMO is immutable we are not able to add args in node-exporter daemonset but I saw that newer release (3.11) has already these ciphers in args
--tls-cipher-suites=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_RSA_WITH_AES_128_CBC_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA256,TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256

My question, is there any options to update CMO to release-3.11?

EtcdHighNumberOfFailedGRPCRequests

I enabled "cluster monitoring operator" on several OKD 3.11 clusters. Everything is working fine except for ETCD monitoring. I followed the documentation to enable etcd monitoring. It's seems to work : most of the checks are green, except the "EtcdHighNumberOfFailedGRPCRequests" which is always triggered (etcd cluster is working correctly). Do I miss something or is there any know issue while enabling etcd cluster monitoring ?

Add a "too many cluster LIST calls" alert

A common naive integration pattern is to LIST all of a type of resource (especially CRDs), process them, and then sleep in a loop. This works for low cardinality resources but eventually could bring down an API server.

https://bugzilla.redhat.com/show_bug.cgi?id=1609862 is an example of a case. Prow has an inefficient LIST poller (will be switched to informer soon). As number of resources grew, the CPU use grew until it caused the server to crash out. Symptoms were CPU saturation (which we have an alert for), but root cause is rate of resources listed.

We should consider alerting when the rate of objects or bytes returned by a list call reaches a threshold proportional to the use of the cluster. I would suggest doing it per resource - if rate LIST per resource * number of objects in resource (or bytes of response as a proxy) > threshold.

operator status transitions to Progressing every 5 minutes on idle cluster

@smarterclayton

Allow -skip-auth-regex to be customized

In our environment, we scrape the data from each prometheus server running inside an OpenShift cluster and federate to an external data store. This is done using prometheus federation.

In my use case, i want to change skip auth regex to ^/(metrics|federate)

In this operator, I would like to be able to customize the -skip-auth-regex in the configmap file. Is this something that would be supported by this operator? Would you be open to a PR to resolve this?

scrape OCP router end point?

Till 3.10 with Prometheus, I was able to scrape the OpenShift router endpoints. Is there any way to achieve this with OCP 3.11?

ability to specify a custom fqdn for routes

The routes do not allow you to put a custom FQDN for prometheus-k8s, alertmanager, or grafana.

I don't know enough go to know how to do this, plus I've never seen routes created from inside a pod. Maybe if it was a template and used an environmental variable it would work? Maybe it can be set in the configmap somehow? If the feature is added I can write the ansible code so it can be set in openshift-ansibile the same way you can set openshift_logging_kibana_hostname and openshift_metrics_hawkular_hostname .

Why do my changes to the prometheus rules only work in a short time?

I edited the prometheus-k8s-rules prometheus rule,then I save it.
But after tens of seconds,the prometheus-k8s-rules prometheus rule became what it was before.

before my change

# oc get prometheusrules prometheus-k8s-rules -o yaml | grep -A 3 -B 5 kubelet_running_pod
    - alert: KubeletTooManyPods
      annotations:
        message: Kubelet {{$labels.instance}} is running {{$value}} pods, close to
          the limit of 110.
      expr: |
        kubelet_running_pod_count{job="kubelet"} > 100
      for: 15m
      labels:
        severity: warning

after my change

# oc get prometheusrules prometheus-k8s-rules -o yaml | grep -A 3 -B 5 kubelet_running_pod
    - alert: KubeletTooManyPods
      annotations:
        message: Kubelet {{$labels.instance}} is running {{$value}} pods, close to
          the limit of 110.
      expr: |
        kubelet_running_pod_count{job="kubelet"} > 90
      for: 15m
      labels:
        severity: warning

# oc get cm prometheus-k8s-rulefiles-0 -o yaml | grep -A 3 -B 5 kubelet_running_pod
      - alert: KubeletTooManyPods
        annotations:
          message: Kubelet {{$labels.instance}} is running {{$value}} pods, close to the
            limit of 110.
        expr: |
          kubelet_running_pod_count{job="kubelet"} > 90
        for: 15m
        labels:
          severity: warning

after tens of seconds

# oc get prometheusrules prometheus-k8s-rules -o yaml | grep -A 3 -B 5 kubelet_running_pod
    - alert: KubeletTooManyPods
      annotations:
        message: Kubelet {{$labels.instance}} is running {{$value}} pods, close to
          the limit of 110.
      expr: |
        kubelet_running_pod_count{job="kubelet"} > 100
      for: 15m
      labels:
        severity: warning

Environment

# oc version
oc v3.11.0+62803d0-1
kubernetes v1.11.0+d4cacc0
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://lb.example.com:8443
openshift v3.11.0+84fe0ad-23
kubernetes v1.11.0+d4cacc0

Prometheus Operator version:

# docker images | grep prometheus-operator
quay.io/coreos/prometheus-operator             v0.23.2             835a7e260b35        2 months ago        47 MB

Restrict kube-prometheus `namespaceSelector`

The current namespaceSelector in main.jsonnet (openshift.*|kube.*|default|logging) allows e.g. kubeless.

Origin: https://bugzilla.redhat.com/show_bug.cgi?id=1634303

Master disappearance is not detected

When a master disappear completely, no alert is triggered.

It seems that Prometheus loose tracks of the metrics related to the deleted master ( up{job="apiserver"}, up{job="etcd"}, up{job="kube-controllers"},... no longer list the deleted instance). All remaining instances are up and thus no alert is triggered.

"kube_cronjob_next_schedule_time" is no data

I've found the following alert rules from prometheus-k8s-rulefiles-0 ConfigMap.

      - alert: KubeCronJobRunning
        annotations:
          message: CronJob {{ $labels.namespaces }}/{{ $labels.cronjob }} is taking more
            than 1h to complete.
        expr: |
          time() - kube_cronjob_next_schedule_time{namespace=~"(openshift-.*|kube-.*|default|logging)",job="kube-state-metrics"} > 3600
        for: 1h
        labels:
          severity: warning

My issues is kube_cronjob_net_scedule_time I queried is no data , yeah this metrics can not find from Prometheus dashboard. Even though I defined and run some CronJob in my cluster. kube_cronjob_next_schedule_time is provided by kube-state-metrics, but it seems the metric could not get the value. If any requirement for getting the metrics, let me know that please.

Thanks.

scrape an app via ServiceMonitor?

I would like to scrape an app sinker which provides metrics already by svc and endpoints.

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.0.0-0.nightly-2019-04-10-182914   True        False         166m    Cluster version is 4.0.0-0.nightly-2019-04-10-182914

$ oc get svc -n ci sinker -o yaml
apiVersion: v1
kind: Service
metadata:
  creationTimestamp: "2019-04-14T18:47:51Z"
  labels:
    app: prow
    component: sinker
  name: sinker
  namespace: ci
  resourceVersion: "81553"
  selfLink: /api/v1/namespaces/ci/services/sinker
  uid: cea189e9-5ee5-11e9-94da-06630a503198
spec:
  clusterIP: 172.30.138.111
  ports:
  - name: http
    port: 8080
    protocol: TCP
    targetPort: 8080
  selector:
    name: sinker
  sessionAffinity: None
  type: ClusterIP
status:
  loadBalancer: {}

$ oc get endpoints -n ci sinker -o yaml
apiVersion: v1
kind: Endpoints
metadata:
  creationTimestamp: "2019-04-14T18:47:51Z"
  labels:
    app: prow
    component: sinker
  name: sinker
  namespace: ci
  resourceVersion: "81700"
  selfLink: /api/v1/namespaces/ci/endpoints/sinker
  uid: cea41cd8-5ee5-11e9-9889-02823e474622
subsets:
- addresses:
  - ip: 10.131.0.92
    nodeName: ip-10-0-134-23.us-east-2.compute.internal
    targetRef:
      kind: Pod
      name: sinker-1-6x8qk
      namespace: ci
      resourceVersion: "81699"
      uid: d3e73684-5ee5-11e9-9889-02823e474622
  ports:
  - name: http
    port: 8080
    protocol: TCP

$ oc get servicemonitors.monitoring.coreos.com sinker -o yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  creationTimestamp: "2019-04-14T18:53:14Z"
  generation: 1
  labels:
    app: prow
    component: sinker
  name: sinker
  namespace: openshift-monitoring
  resourceVersion: "84091"
  selfLink: /apis/monitoring.coreos.com/v1/namespaces/openshift-monitoring/servicemonitors/sinker
  uid: 8eb3db6f-5ee6-11e9-94da-06630a503198
spec:
  endpoints:
  - interval: 30s
    port: http
    scheme: http
  namespaceSelector:
    matchNames:
    - ci
  selector:
    matchLabels:
      app: prow
      component: sinker

After adding the above ServiceMonitor sinker, i do not see any new target on the prometheus UI.

The existing ServiceMonitor cluster-version-operator has a similar setup and it works fine.
I must have missed some steps. Can anyone help me point them out?
Thanks.

$ oc get servicemonitors.monitoring.coreos.com cluster-version-operator -o yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  creationTimestamp: "2019-04-14T16:29:22Z"
  generation: 1
  labels:
    k8s-app: cluster-version-operator
  name: cluster-version-operator
  namespace: openshift-monitoring
  resourceVersion: "14145"
  selfLink: /apis/monitoring.coreos.com/v1/namespaces/openshift-monitoring/servicemonitors/cluster-version-operator
  uid: 75af2801-5ed2-11e9-a49e-02823e474622
spec:
  endpoints:
  - interval: 30s
    port: metrics
    scheme: http
  namespaceSelector:
    matchNames:
    - openshift-cluster-version
  selector:
    matchLabels:
      k8s-app: cluster-version-operator

How to monitor custom metrics okd 3.11 / Problem to add another target

Hello All!

I have a okd 3.11 installed with full stack of monitoring running perfect.

I would like to monitor custom metrics from my pods. I have a pod that exposes /prometheus path with some metrics and i would like to collect them to use to monitor and alarm.

I have configured a ServiceMonitor but without success. If i go to "Configuration" options on Prometheus web interface, i can see my custom configurations from my service monitor, but on " Targets" options, i cannot see my target there.

Can anyone help me? Thanks!

If you need some more details about the problem, please let me know.

OSX mapfile command not found

./hack/build-jsonnet.sh

set -o pipefail
prefix=assets
rm -rf assets
mkdir assets
rm -rf tmp
mkdir tmp
jsonnet -J jsonnet/vendor jsonnet/main.jsonnet
mapfile -t files
++ jq -r 'keys[]' tmp/main.json
./hack/build-jsonnet.sh: line 16: mapfile: command not found

how it build in mac ox ?

Prometheus not able to scrape Kubelet

I have the cluster monitoring operator running in Minishift (minishift v1.24.0+8a904d0), but Prometheus fails to scrape the Kubelet with the following errors:

Get https://10.0.2.15:10250/metrics: x509: certificate is valid for 127.0.0.1, not 10.0.2.15
Get https://10.0.2.15:10250/metrics/cadvisor: x509: certificate is valid for 127.0.0.1, not 10.0.2.15

Is this a Minishift problem?

Unable to get prometheus Data From Grafana

I'm unable to get any data from Grafana. I have deployed the manifest files for promtheus operator and grafana from your repository. However it seems that the prometheus datasource created on grafana does not allow to connect on prometheus which is protected by an oauth-proxy.

etcd monitoring

In OpenShift 3.10 etcd is a static pod, so we can use the means available to discover it like any other workload through the Kubernetes API.

The difficult thing is to authenticate and authorize against etcd, currently etcd is only accessible via certificate, which only the apiserver has.

I propose a similar solution as we were going to ship with UT2 of Tectonic. A separate instance of Prometheus entirely dedicated to monitoring etcd. To secure it, etcd authorization is enabled and a certificate is supplied to apiserver that has access to key-space and the etcd-Prometheus with one that does not - still leaving it with access to the /metrics endpoint. Additionally in 3.11 we may be able to ship etcd 3.3 which allows segregating this network-wise as well as etcd allows exposing a port purely for exposing metrics.

/cc @smarterclayton @mxinden @elad661 @ironcladlou

oc 3.10 / Mac OS X, prometheus-operator won't come up

I'm trying to get the cluster-monitoring-operator working on openshift 3.10 - I'm on Mac OS X and I'm using

lilguylaptop:~ cunningt$ ~/bin/v3.10/oc version
oc v3.10.0+dd10d17
kubernetes v1.10.0+b81c8f8
features: Basic-Auth

Server https://127.0.0.1:8443
openshift v3.10.0+7eee6f8-2
kubernetes v1.10.0+b81c8f8

When I use scripts/deploy-on-openshift.sh to install the cluster-monitoring-operator, it looks like the cluster-monitoring-operator comes up okay, but the prometheus-operator does not :

lilguylaptop:~ cunningt$ ~/bin/v3.10/oc get pods
NAME READY STATUS RESTARTS AGE
cluster-monitoring-operator-5c7cb9d65-vbqwv 1/1 Running 0 10m
prometheus-operator-558f555f45-d6wj4 0/1 CrashLoopBackOff 6 10m

Attached is the log for the prometheus-operator.

prometheus-operator-558f555f45-d6wj4.log

KubePodNotReady is being fired for pods with restartPolicy=RestartNever

Seeing this on api.ci but a bunch of job pods are firing this alert but that is not valid. Most job pods never go ready.

Pods without RestartAlways restartPolicy shouldn't be part of KubePodNotReady.

Duplicate RBAC rules in cluster-monitoring-operator-role.yaml

I'm guessing something is up with the manifest generation, but there are duplicate RBAC rules in https://github.com/openshift/cluster-monitoring-operator/blob/9a3ff15b5784580f11c9fe04e077e2524da2dc56/manifests/cluster-monitoring-operator-role.yaml#L40:15

binary assets for testing

Do we really want to keep hardcoded binaries for testing?
It may be better to have some other approach for assertion like make sure the yaml includes certain objects like templating raws etc ?!

currently, I've change some things at the file travis fails..

Telemeter deployment references missing key `id` in secret

The deployment under assets/telemeter-client/deployment.yaml references the key id in the secret assets/telemeter-client/secret.yaml which does not exist, leading into a non-functional deployment.

Customizing container args

Is there currently an easier way to add custom arguments to the monitoring stack's components? For example, I'm currently using oc patch as a hacky workaround to add the -request-logging=true argument to the oauth-proxy container, but if there are multiple arguments to update, across multiple containers, that quickly becomes difficult to manage. In addition, due to the reconciliation process, I'm not sure how I can modify other fields that aren't directly exposed (e.g. replicas). It looks like some configuration options are available (https://github.com/openshift/cluster-monitoring-operator/blob/master/Documentation/user-guides/configuring-cluster-monitoring.md#reference), but unfortunately that isn't sufficient for my use case.

DiskRunningFull is imprecise

https://prometheus-k8s-openshift-monitoring.svc.ci.openshift.org/graph?g0.expr=ALERTS%7Balertname%3D%22NodeDiskRunningFull%22%7D&g0.tab=1

produces results that don't look correct:

ALERTS{alertname="NodeDiskRunningFull",alertstate="firing",device="/dev/sda1",endpoint="https",fstype="xfs",instance="10.142.0.13:9100",job="node-exporter",mountpoint="/run/secrets",namespace="openshift-monitoring",pod="node-exporter-6xzlm",service="node-exporter",severity="warning"} | 1
-- | --
ALERTS{alertname="NodeDiskRunningFull",alertstate="firing",device="/dev/sda1",endpoint="https",fstype="xfs",instance="10.142.0.15:9100",job="node-exporter",mountpoint="/run/secrets",namespace="openshift-monitoring",pod="node-exporter-lgqz2",service="node-exporter",severity="warning"} | 1
ALERTS{alertname="NodeDiskRunningFull",alertstate="firing",device="/dev/sda1",endpoint="https",fstype="xfs",instance="10.142.0.16:9100",job="node-exporter",mountpoint="/run/secrets",namespace="openshift-monitoring",pod="node-exporter-5c55w",service="node-exporter",severity="warning"} | 1
ALERTS{alertname="NodeDiskRunningFull",alertstate="firing",device="/dev/sda1",endpoint="https",fstype="xfs",instance="10.142.0.17:9100",job="node-exporter",mountpoint="/run/secrets",namespace="openshift-monitoring",pod="node-exporter-2722t",service="node-exporter",severity="warning"} | 1
ALERTS{alertname="NodeDiskRunningFull",alertstate="firing",device="overlay",endpoint="https",fstype="overlay",instance="10.142.0.13:9100",job="node-exporter",mountpoint="/",namespace="openshift-monitoring",pod="node-exporter-6xzlm",service="node-exporter",severity="warning"} | 1
ALERTS{alertname="NodeDiskRunningFull",alertstate="firing",device="overlay",endpoint="https",fstype="overlay",instance="10.142.0.15:9100",job="node-exporter",mountpoint="/",namespace="openshift-monitoring",pod="node-exporter-lgqz2",service="node-exporter",severity="warning"} | 1
ALERTS{alertname="NodeDiskRunningFull",alertstate="firing",device="overlay",endpoint="https",fstype="overlay",instance="10.142.0.16:9100",job="node-exporter",mountpoint="/",namespace="openshift-monitoring",pod="node-exporter-5c55w",service="node-exporter",severity="warning"} | 1
ALERTS{alertname="NodeDiskRunningFull",alertstate="firing",device="overlay",endpoint="https",fstype="overlay",instance="10.142.0.17:9100",job="node-exporter",mountpoint="/",namespace="openshift-monitoring",pod="node-exporter-2722t",service="node-exporter",severity="warning"} | 1
ALERTS{alertname="NodeDiskRunningFull",alertstate="firing",device="rootfs",endpoint="https",fstype="rootfs",instance="10.142.0.13:9100",job="node-exporter",mountpoint="/",namespace="openshift-monitoring",pod="node-exporter-6xzlm",service="node-exporter",severity="warning"} | 1
ALERTS{alertname="NodeDiskRunningFull",alertstate="firing",device="rootfs",endpoint="https",fstype="rootfs",instance="10.142.0.15:9100",job="node-exporter",mountpoint="/",namespace="openshift-monitoring",pod="node-exporter-lgqz2",service="node-exporter",severity="warning"} | 1
ALERTS{alertname="NodeDiskRunningFull",alertstate="firing",device="rootfs",endpoint="https",fstype="rootfs",instance="10.142.0.16:9100",job="node-exporter",mountpoint="/",namespace="openshift-monitoring",pod="node-exporter-5c55w",service="node-exporter",severity="warning"} | 1
ALERTS{alertname="NodeDiskRunningFull",alertstate="firing",device="rootfs",endpoint="https",fstype="rootfs",instance="10.142.0.17:9100",job="node-exporter",mountpoint="/",namespace="openshift-monitoring",pod="node-exporter-2722t",service="node-exporter",severity="warning"} | 1

/run/secrets may be the RHEL container runtime injection of the mount secrets, but probably shouldn't be. @mrunalp I would not have expected the magic RHEL mount to be using the regular disk - is there anything cheap and easy we could do here to screen that out (by using a tmpfs read only mount, maybe)?

Seeing both the overlay and rootfs warnings is a bit wierd as well - I can understand why that may happen, but preferably it would not.

Finally, in all these cases our disk use is below the GC threshold, so the alerts are firing because the nodes are under active use (i.e. we're downloading and cleaning up images constantly). The warning is not accurate in a case like this because they'll always be firing for these nodes. It's useful to know that we're on track to run out minus the image storage... but not sure how to represent that easily.

Disable Alerting rules on prometheus-k8s instance

Is there any way to disable some rules on the "prometheus-k8s" instance. I already tried to modify the prmetheusrule "prometheus-k8s-rules", which work for some seconds but it is overwrited with the former configuration probably by the prometheus operator.

Where are the web interfaces exposed?

I have the operator up and running in minishift after using the deploy-on-openshift script, but I am having trouble understanding where I can reach the prometheus, alertmanager and grafana dashboards. According to the docs, they should be available at the $CLUSTER_DNS, but where exactly is this?

can we create a default serviceaccount with view permissions on the openshift-monitoring namespace?

For use with grafana in our lab. Does it make sense to have this created by the installer?