A sidecar for the Prometheus server that can send metrics to Stackdriver.

Home Page: https://cloud.google.com/monitoring/kubernetes-engine/prometheus

License: Apache License 2.0

Go 95.08% Makefile 1.56% Shell 1.91% Dockerfile 0.10% HTML 1.35%

stackdriver prometheus monitoring observability

stackdriver-prometheus-sidecar's People

Contributors

Stargazers

Watchers

stackdriver-prometheus-sidecar's Issues

Recorded metrics are not sent to Stackdriver

I have the following recording rule:

- name: custom_group.rules
      rules:
      - expr: sum(rate(node_cpu_seconds_total{mode!="idle",mode!="iowait"}[3m])) BY (instance, job)
        record: instance:node_cpu_custom:rate:sum

The rule is evaluated and recorded correctly:

Element	Value
instance:node_cpu_custom:rate:sum{instance="172.17.189.157:9100",job="node-exporter"}	0.6291333333333333
instance:node_cpu_custom:rate:sum{instance="172.17.189.178:9100",job="node-exporter"}	1.153220537845439

I have also the following config file:

static_metadata:
     - metric: instance:node_cpu_custom:rate:sum
       type: gauge
       help: node cpu ratio

No metric appers in Stackdriver. Also no errors in the stackdriver-prometheus-sidecar logs.
This is just an example of the recorded metric. All other recorded metrics are also not sent to Stackdriver. Raw metrics are exported to Stackdriver as it should be.

Uploading GB of metrics in short time

Before we begin, it is important to note that I made this attempt after what I had come to know as Stackdriver Monitoring had become Legacy Stackdriver without being aware.

In retrospect, I realize I may have too quickly skimmed the instructions that clearly said:

You cannot configure clusters using Legacy Stackdriver with Prometheus.

I assumed, in error, that I did have Stackdriver Kubernetes Engine Monitoring enabled in my GKE cluster when I went about following the instructions here: https://cloud.google.com/monitoring/kubernetes-engine/prometheus

Additionally, I took a chance to see if I could get stackdriver-prometheus-sidecar working with our deployment of prometheus-operator version 0.29.0. Specifically, I deployed the Helm chart version 5.7.0. This deploys with prometheus version 2.9.1 which isn't listed in the compatibility matrix.

This was the container template:

      containers:
      - name: sidecar
        image: gcr.io/stackdriver-prometheus/stackdriver-prometheus-sidecar:{{ .Values.sidecar.tag }}
        args:
        - "--stackdriver.project-id={{ .Values.cluster.project }}"
        - "--prometheus.wal-directory=/prometheus/wal"
        - "--stackdriver.kubernetes.location={{ .Values.cluster.region }}"
        - "--stackdriver.kubernetes.cluster-name={{ .Values.cluster.name }}"
        ports:
        - name: sidecar
          containerPort: 9091
        volumeMounts:
        - name: prometheus-prometheus-operator-prometheus-db
          mountPath: /prometheus

Despite all of this, I was able to get prometheus-operator configured to deploy the stackdriver-prometheus-sidecar with prometheus deploys. These deploys were sending collected metrics to Stackdriver as expected. They were clearly showing up in the Stackdriver Metrics Explorer and I assumed all was well.

It is important to note that this is a small cluster which reports very little in the way of metrics and only a small group of custom metrics.

I was surprised to find when I went to check the billing that it had uploaded almost 10 GB of metrics in such a short time.

I was hoping it might be clear what the root cause is.

Is it because it was running on Stackdriver Legacy?
Is it because of an incompatible version or Prometheus?
Is it because of the Prometheus configuration?

Thanks for your time.

Sidecar stopped submitting stats to StackDriver after prometheus eviction on GKE

We had stackdriver-prometheus-sidecar working on our GKE cluster without issues for a few days until the prometheus pod where the server and this sidecar resided was evicted. The pod was relocated automatically by the scheduler and Prometheus continued collecting data without issues. However, the sidecar stopped submitting data to StackDriver. We can manually query the data from Prometheus without issues, there's no discernible missing data.

Versions:
prom/prometheus:v2.6.1
gcr.io/stackdriver-prometheus/stackdriver-prometheus-sidecar:release-0.3.1

Prometheus Logs:

level=info ts=2019-01-26T15:28:19.524238548Z caller=main.go:243 msg="Starting Prometheus" version="(version=2.6.1, branch=HEAD, revision=b639fe140c1f71b2cbad3fc322b17efe60839e7e)"
level=info ts=2019-01-26T15:28:19.524313035Z caller=main.go:244 build_context="(go=go1.11.4, user=root@4c0e286fe2b3, date=20190115-19:12:04)"
level=info ts=2019-01-26T15:28:19.524350038Z caller=main.go:245 host_details="(Linux 4.15.0-1019-gcp #20-Ubuntu SMP Wed Aug 29 09:24:47 UTC 2018 x86_64 prometheus-server-c64fd76c-msmv5 (none))"
level=info ts=2019-01-26T15:28:19.524380622Z caller=main.go:246 fd_limits="(soft=1048576, hard=1048576)"
level=info ts=2019-01-26T15:28:19.524485977Z caller=main.go:247 vm_limits="(soft=unlimited, hard=unlimited)"
level=info ts=2019-01-26T15:28:19.525637533Z caller=main.go:561 msg="Starting TSDB ..."
level=info ts=2019-01-26T15:28:19.527003933Z caller=web.go:429 component=web msg="Start listening for connections" address=0.0.0.0:9090
level=info ts=2019-01-26T15:28:19.574606943Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1548244800000 maxt=1548266400000 ulid=01D1Y8722N7PQF05HFZWCSFMK2
level=info ts=2019-01-26T15:28:19.599054445Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1548266400000 maxt=1548331200000 ulid=01D2060M2RPVQJBJ53YRAQNRV6
level=info ts=2019-01-26T15:28:19.601827225Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1548331200000 maxt=1548396000000 ulid=01D223SQDT2MRJR8E8XMKXJ3BE
level=info ts=2019-01-26T15:28:19.621762297Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1548396000000 maxt=1548460800000 ulid=01D241K5Z5ZM6GHAJ6QFZX8PJY
level=info ts=2019-01-26T15:28:19.622993701Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1548460800000 maxt=1548482400000 ulid=01D24P696MFWW21CJZJGNEC14Q
level=info ts=2019-01-26T15:28:19.635275773Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1548504000000 maxt=1548511200000 ulid=01D25ASCGQ3N7HAHP75Q15SRVX
level=info ts=2019-01-26T15:28:19.636098417Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1548482400000 maxt=1548504000000 ulid=01D25ASDY6Y1S5K0AJZSCYB6GR
level=info ts=2019-01-26T15:28:39.425314438Z caller=main.go:571 msg="TSDB started"
level=info ts=2019-01-26T15:28:39.425436903Z caller=main.go:631 msg="Loading configuration file" filename=/etc/config/prometheus.yml
level=info ts=2019-01-26T15:28:39.683670989Z caller=kubernetes.go:201 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2019-01-26T15:28:39.734034519Z caller=kubernetes.go:201 component="discovery manager notify" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2019-01-26T15:28:39.735718748Z caller=main.go:657 msg="Completed loading of configuration file" filename=/etc/config/prometheus.yml
level=info ts=2019-01-26T15:28:39.735774879Z caller=main.go:530 msg="Server is ready to receive web requests."

Sidecar Logs

level=info ts=2019-01-26T15:28:18.633697598Z caller=main.go:256 msg="Starting Stackdriver Prometheus sidecar" version="(version=0.3.1, branch=release-0.3.1, revision=12aa811802effd7a93b6b7d83d72bf381f1179b9)"
level=info ts=2019-01-26T15:28:18.633825024Z caller=main.go:257 build_context="(go=go1.11, [email protected], date=20190103-20:48:32)"
level=info ts=2019-01-26T15:28:18.633857706Z caller=main.go:258 host_details="(Linux 4.15.0-1019-gcp #20-Ubuntu SMP Wed Aug 29 09:24:47 UTC 2018 x86_64 prometheus-server-c64fd76c-msmv5 (none))"
level=info ts=2019-01-26T15:28:18.633878294Z caller=main.go:259 fd_limits="(soft=1048576, hard=1048576)"
level=info ts=2019-01-26T15:28:18.646053105Z caller=main.go:463 msg="Web server started"
level=info ts=2019-01-26T15:28:18.646116703Z caller=main.go:444 msg="Stackdriver client started"
level=warn ts=2019-01-26T15:28:21.653132839Z caller=main.go:525 msg="Prometheus not ready" status="503 Service Unavailable"
level=warn ts=2019-01-26T15:28:24.652898547Z caller=main.go:525 msg="Prometheus not ready" status="503 Service Unavailable"
level=warn ts=2019-01-26T15:28:27.651712449Z caller=main.go:525 msg="Prometheus not ready" status="503 Service Unavailable"
level=warn ts=2019-01-26T15:28:30.652042511Z caller=main.go:525 msg="Prometheus not ready" status="503 Service Unavailable"
level=warn ts=2019-01-26T15:28:33.652120435Z caller=main.go:525 msg="Prometheus not ready" status="503 Service Unavailable"
level=warn ts=2019-01-26T15:28:36.652012708Z caller=main.go:525 msg="Prometheus not ready" status="503 Service Unavailable"
level=warn ts=2019-01-26T15:28:39.652032741Z caller=main.go:525 msg="Prometheus not ready" status="503 Service Unavailable"
level=info ts=2019-01-26T15:29:42.652116635Z caller=manager.go:150 component="Prometheus reader" msg="Starting Prometheus reader..."

We can see through docker stats that the CPU activity on prometheus and the sidecar is around 800m each. If we remove the sidecar, Prometheus drops to 5-10m.

I'm guessing that this may be related to the repair.go messages on Prometheus' logs so we let the sidecar work. However, after a few hours there's still no progress or logs. Our data dir is around 600mb, it shouldn't take that long so it seems that it's stuck.

We have added --log.level=debug to the sidecar but there's no difference except during the shutdown phase that shows a "Stopped resharding":

level=warn ts=2019-01-26T16:39:23.977263853Z caller=main.go:392 msg="Received SIGTERM, exiting gracefully..."
level=info ts=2019-01-26T16:39:23.977387876Z caller=main.go:432 msg="Stopping Prometheus reader..."
level=info ts=2019-01-26T16:39:23.97741396Z caller=queue_manager.go:233 component=queue_manager msg="Stopping remote storage..."
level=debug ts=2019-01-26T16:39:23.977513514Z caller=queue_manager.go:441 component=queue_manager msg="Stopped resharding"
level=info ts=2019-01-26T16:39:23.977599116Z caller=queue_manager.go:241 component=queue_manager msg="Remote storage stopped."

[suggestion] specify necessary IAM roles in documentation

I had an issue where metrics weren't showing up in Monitoring and I saw the following error in the sidecar logs being printed continuously:

level=warn ts=2019-07-30T05:13:17.12753024Z caller=queue_manager.go:546 component=queue_manager msg="Unrecoverable error sending samples to remote storage" err="rpc error: code = PermissionDenied desc = Permission monitoring.timeSeries.create denied (or the resource may not exist)."

It was fixed by creating an IAM policy binding from the service account associated with the cluster to the role roles/monitoring.metricWriter.

Might be helpful to explicitly specify that this permission is required somewhere in the documentation.

Filter not working

I'm setting the filter as follows:

        - '--filter=__name__="consumer_group_backlog_avg_10m"'

However when I trace the logs in the sidecar it stays here and doesn't seem to forward any metrics

level=info ts=2019-02-22T12:52:07.654292427Z caller=main.go:256 msg="Starting Stackdriver Prometheus sidecar" version="(version=0.4.0, branch=master, revision=e246041acf99c8487e1ac73552fb8625339c64a1)"
level=info ts=2019-02-22T12:52:07.654367128Z caller=main.go:257 build_context="(go=go1.11.4, user=kbuilder@kokoro-gcp-ubuntu-prod-217445279, date=20190221-15:24:24)"
level=info ts=2019-02-22T12:52:07.654414564Z caller=main.go:258 host_details="(Linux 4.14.65+ #1 SMP Thu Oct 25 10:42:50 PDT 2018 x86_64 prometheus-84b8bdf44-6kcw8 (none))"
level=info ts=2019-02-22T12:52:07.654645769Z caller=main.go:259 fd_limits="(soft=1048576, hard=1048576)"
level=info ts=2019-02-22T12:52:07.658270228Z caller=main.go:463 msg="Web server started"
level=info ts=2019-02-22T12:52:07.658797109Z caller=main.go:444 msg="Stackdriver client started"
level=info ts=2019-02-22T12:53:10.664382837Z caller=manager.go:150 component="Prometheus reader" msg="Starting Prometheus reader..."
level=info ts=2019-02-22T12:53:10.668043076Z caller=manager.go:211 component="Prometheus reader" msg="reached first record after start offset" start_offset=0 skipped_records=0

When I curl the prometheus server it should be querying metrics for it does seem to have the metric I'm trying to filter for:

root@myserver:/# curl prometheus:9090/api/v1/query?query=consumer_group_backlog_avg_10m | jq .
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   313  100   313    0     0  43745      0 --:--:-- --:--:-- --:--:-- 44714
{
  "status": "success",
  "data": {
    "resultType": "vector",
    "result": [
      {
        "metric": {
          "__name__": "consumer_group_backlog_avg_10m",
          "consumer_group": "media-extractor"
        },
        "value": [
          1550838129.789,
          "892269.2"
        ]
      },
      {
        "metric": {
          "__name__": "consumer_group_backlog_avg_10m",
          "consumer_group": "summarizer"
        },
        "value": [
          1550838129.789,
          "548159.4"
        ]
      }
    ]
  }
}

Does anything seem clearly wrong? What could be the issue? Thanks!

Move monitored resource labels to explicit flags

Currently we still have the stackdriver.global-label flag to inject parameters such as cluster name and location for assembling of the monitored resource.
That's essentially a leftover when the mapping via relabeling was exposed to the user. Currently it seems unclear to the user why those values are "global labels".

I think it would make sense to move these parameters to explicit flags, such as stackdriver.k8s.cluster-name and stackdriver.k8s.location. That makes it easier to document and validate. We potentially end up with quite a few flags for each environment we add support for but I don't see that as a downside in general.

If we are confident that location will always be a relevant notion for a all monitored resources it could just become stackdriver.location.

Federation metrics are not pushed to Stackdriver

I scrape with Prometheus several exporters (for example node-exporter) as well as /federate endpoint of another Prometheus instance. I can see all metrics being scraped and collected. When I try to push the metrics to Stackdriver, I can see the metrics originated from the exporters being pushed. However, the metrics from the/federate endpoint are never pushed to Stackdriver. No errors in the side car logs

Support newer versions of Prometheus

According to the README, stackdriver-prometheus-sidecar supports Prometheus 2.4.3 - Prometheus 2.6.x. There have been several releases of Prometheus since 2.6.x, most recently 2.12.0. Please could a newer versions be supported?

Thanks.

Ensure that recorded metrics get sent to Stackdriver

We need to make sure we can map them to metadata.

Unable to use filter with sidecar

I have prometheus deployment and while trying to run stackdriver-promethues-sidecar.
This is the error i am seeing in the container logs
labels: {…}
logName: "projects/qa-setup/logs/sidecar"
receiveTimestamp: "2019-05-11T09:53:05.755076285Z"
resource: {…}
severity: "ERROR"
textPayload: "level=error ts=2019-05-11T09:52:30.840948761Z caller=main.go:291 msg="Error parsing filters" err="invalid filter "{__name__~io_harness_custom_metric_learning_engine_clustering_task_queued_time_in_seconds}""
"

Here is the snippet of the yaml file containing the sidecar related details.
spec:
containers:

args:
- --stackdriver.project-id=qa-setup
- --prometheus.wal-directory=/prometheus/wal
- --prometheus.api-address=http://prometheus-service:8080
- --stackdriver.kubernetes.location=us-west1-a
- --stackdriver.kubernetes.cluster-name=qa-setup
- --stackdriver.generic.location=harness
- --stackdriver.generic.namespace=harness
- --filter={__name__~io_harness_custom_metric_learning_engine_clustering_task_queued_time_in_seconds}
  image: gcr.io/stackdriver-prometheus/stackdriver-prometheus-sidecar@sha256:8559091f69726faf34d249fafbb0faf7c30bf655af24207d03b787de8940b000
  imagePullPolicy: Always
  name: sidecar

Count values shoot off to infinity

Some of our metrics from our Kubernetes app sent to Stackdriver via stackdriver-prometheus-sidecar result in Stackdriver graphs which shoot off to infinity. Our alerts are triggered when this happens. Plots for the same time series look normal in Prometheus. I was wondering if anyone else is experiencing this, or could suggest what the cause was?

Thanks for any assistance you can give.

Confusing VERSION file on master branch

VERSION file on master branch is confusing. For example,
fc1c9e2 is included in release 0.4.2, but the VERSION file still shows as 0.4.0.

#81 pointed out VERSION file is used to build the binary. Here is an example when you show the version of the binary:

$ ./stackdriver-prometheus-sidecar --version
prometheus, version 0.4.0 (branch: master, revision: b9b9b5facf6dc90e233ce93927960d5aee4bf360)
  build user:       [REDACTED]
  build date:       20190517-14:47:05
  go version:       go1.12

I purpose that we use latest in VERSION on master, so when anyone build the binary from master, this user knows that it's not on a specific named version.

target refresh failed: unexpected response status: 404 Not Found

Hi,

I try to use stackdriver-prometheus-sidecar with my prometheus-server who installed used stable/prometheus helm chart.

I configure my stackdriver-prometheus-sidecar with --prometheus.api-address=http://127.0.0.1:9090/prometheus/, because in Prometheus deployment I use prefixURL.

Then when my pod with stackdriver-prometheus-sidecar I see errors like this and zero metrics in Stackdriver.

level=error ts=2018-12-31T07:41:16.764201212Z caller=cache.go:76 msg="refresh failed" err="unexpected response status: 404 Not Found"
level=warn ts=2018-12-31T07:42:13.044253069Z caller=manager.go:243 component="Prometheus reader" msg="Failed to build sample" err="get series information: retrieving target failed: target refresh failed: unexpected response status: 404 Not Found"

As I understand it's some problem with targetsURL, because service try get data from /api/v1/targets/metadata without using prefixURL specified in prometheus.api-address

targetsURL, err := cfg.prometheusURL.Parse(targets.DefaultAPIEndpoint)

Sidecar offset is less than 0

stackdriver_sidecar.json is getting initialized to {"offset":-524288}. This looks suspicious.

Instrument with OpenCensus

This is for monitoring the sidecar itself, not how we write the Prometheus data.

This is important so we don't rely on Prometheus to monitor the Prometheus sidecar (although that will be an option with OpenCensus).

https://opencensus.io/go/index.html

Shard count gets stuck

One problem I ran into when playing with latencies as part of #20: the remote queue increases or decreases shards based on the rate of incoming and outgoing samples.
The problem is that in stackdriver-prometheus{-sidecar} we removed the code that drops incoming samples if the queue is full. The shard computation heuristic was modified in general as well but the root cause of this issue would have hit either way I believe.

The issue is, that we now block until new slots in the queue get freed and we can insert new samples. If the queue ever runs full, that means inflow equals outflow rate. That messes with the sharding heuristic, which now thinks we don't need more shards.
If the queue fills up before the shard number gets bumped to a number where it can sustain incoming throughput, this means we are forever stuck with too few shards to send our data. This affects stackdriver-prometheus as well.

We have a few options now:

Bring back dropping of samples if the queue is full. We will lose some data at startup for busy servers until the shards number has ramped up.
Drop in-memory queues and find new heuristic to compute the required number of shards/outgoing connections. For example, based on our WAL read rate (which receives backpressure if we cannot send fast enough) vs its growth rate.
Extend heuristic of resharding to simply bump shard count by 1 as long as the queue is full.

Probably 3 is simplest and best for the sidecar use case. 2 seems more elegant but is more complex to implement and means more changes in general.
For stackdriver-prometheus we should seriously consider 1 since it can affect current production users.

@jkohen could you provide some insight as well on why we dropped the outflow rate from the heuristic and instead use inflow + timeout? That likely influences what the right approach will be.

Handle metric typing errors gracefully

This is a feature request related to #86 and other reports where users encounter that the metric type has changed.

This can happen for instance:

If one Prometheus client (your application) is exposing the metrics as a GAUGE and another as a CUMULATIVE.
If the metric type changes over time.

We could apply best-effort transformations to cast the type to the stored one, but there is no ideal solution and we probably want to emit a warning.

Make metric prefix configurable and make project information config optional

Forward port Stackdriver/stackdriver-prometheus#19

TimeSeries could not be written

Hello,

First time setting this up, and I'm getting several warnings of the kind:

level=warn ts=2019-08-27T12:11:47.182Z caller=queue_manager.go:546 component=queue_manager msg="Unrecoverable error sending samples to remote storage" err="rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: Unrecognized region or location.: timeSeries[0-13]"

And no metrics are written to stackdriver.

Any ideas what this might be?

Thank you

prometheus is crashing after sidecar injection

error message : Tailing WAL failed: retrieve last checkpoint: open /data/wal: no such file or directory

export KUBE_NAMESPACE=monitoring
export GCP_PROJECT=<project_name>
export GCP_REGION=us-central1
export KUBE_CLUSTER=standard-cluster-1
export SIDECAR_IMAGE_TAG=release-0.4.0

prometheus operator values.yaml

# Default values for prometheus-operator.
# This is a YAML-formatted file.
# Declare variables to be passed into your templates.

## Provide a name in place of prometheus-operator for `app:` labels
##
nameOverride: ""

## Provide a name to substitute for the full names of resources
##
fullnameOverride: ""

## Labels to apply to all resources
##
commonLabels: {}
# scmhash: abc123
# myLabel: aakkmd

## Create default rules for monitoring the cluster
##
defaultRules:
  create: true
  ## Labels for default rules
  labels: {}
  ## Annotations for default rules
  annotations: {}

##
global:
  rbac:
    create: true
    pspEnabled: true

  ## Reference to one or more secrets to be used when pulling images
  ## ref: https://kubernetes.io/docs/tasks/configure-pod-container/pull-image-private-registry/
  ##
  imagePullSecrets: []
  # - name: "image-pull-secret"

## Configuration for alertmanager
## ref: https://prometheus.io/docs/alerting/alertmanager/
##
alertmanager:

  ## Deploy alertmanager
  ##
  enabled: true

  ## Service account for Alertmanager to use.
  ## ref: https://kubernetes.io/docs/tasks/configure-pod-container/configure-service-account/
  ##
  serviceAccount:
    create: true
    name: ""

  ## Configure pod disruption budgets for Alertmanager
  ## ref: https://kubernetes.io/docs/tasks/run-application/configure-pdb/#specifying-a-poddisruptionbudget
  ## This configuration is immutable once created and will require the PDB to be deleted to be changed
  ## https://github.com/kubernetes/kubernetes/issues/45398
  ##
  podDisruptionBudget:
    enabled: false
    minAvailable: 1
    maxUnavailable: ""

  ## Alertmanager configuration directives
  ## ref: https://prometheus.io/docs/alerting/configuration/#configuration-file
  ##      https://prometheus.io/webtools/alerting/routing-tree-editor/
  ##
  config:
    global:
      resolve_timeout: 5m
    route:
      group_by: ['job']
      group_wait: 30s
      group_interval: 5m
      repeat_interval: 12h
      receiver: 'null'
      routes:
      - match:
          alertname: DeadMansSwitch
        receiver: 'null'
    receivers:
    - name: 'null'

  ## Alertmanager template files to format alerts
  ## ref: https://prometheus.io/docs/alerting/notifications/
  ##      https://prometheus.io/docs/alerting/notification_examples/
  ##
  templateFiles: {}
  #
  # An example template:
  #   template_1.tmpl: |-
  #       {{ define "cluster" }}{{ .ExternalURL | reReplaceAll ".*alertmanager\\.(.*)" "$1" }}{{ end }}
  #
  #       {{ define "slack.myorg.text" }}
  #       {{- $root := . -}}
  #       {{ range .Alerts }}
  #         *Alert:* {{ .Annotations.summary }} - `{{ .Labels.severity }}`
  #         *Cluster:*  {{ template "cluster" $root }}
  #         *Description:* {{ .Annotations.description }}
  #         *Graph:* <{{ .GeneratorURL }}|:chart_with_upwards_trend:>
  #         *Runbook:* <{{ .Annotations.runbook }}|:spiral_note_pad:>
  #         *Details:*
  #           {{ range .Labels.SortedPairs }} • *{{ .Name }}:* `{{ .Value }}`
  #           {{ end }}

  ingress:
    enabled: false

    annotations: {}

    labels: {}

    ## Hosts must be provided if Ingress is enabled.
    ##
    hosts: []
      # - alertmanager.domain.com

    ## TLS configuration for Alertmanager Ingress
    ## Secret must be manually created in the namespace
    ##
    tls: []
    # - secretName: alertmanager-general-tls
    #   hosts:
    #   - alertmanager.example.com

  ## Configuration for Alertmanager service
  ##
  service:
    annotations: {}
    labels: {}
    clusterIP: ""

  ## Port to expose on each node
  ## Only used if service.type is 'NodePort'
  ##
    nodePort: 30903
  ## List of IP addresses at which the Prometheus server service is available
  ## Ref: https://kubernetes.io/docs/user-guide/services/#external-ips
  ##
    externalIPs: []
    loadBalancerIP: ""
    loadBalancerSourceRanges: []
    ## Service type
    ##
    type: ClusterIP

  ## If true, create a serviceMonitor for alertmanager
  ##
  serviceMonitor:
    selfMonitor: true

  ## Settings affecting alertmanagerSpec
  ## ref: https://github.com/coreos/prometheus-operator/blob/master/Documentation/api.md#alertmanagerspec
  ##
  alertmanagerSpec:
    ## Standard object’s metadata. More info: https://github.com/kubernetes/community/blob/master/contributors/devel/api-conventions.md#metadata
    ## Metadata Labels and Annotations gets propagated to the Alertmanager pods.
    ##
    podMetadata: {}

    ## Image of Alertmanager
    ##
    image:
      repository: quay.io/prometheus/alertmanager
      tag: v0.15.3

    ## Secrets is a list of Secrets in the same namespace as the Alertmanager object, which shall be mounted into the
    ## Alertmanager Pods. The Secrets are mounted into /etc/alertmanager/secrets/.
    ##
    secrets: []

    ## ConfigMaps is a list of ConfigMaps in the same namespace as the Alertmanager object, which shall be mounted into the Alertmanager Pods.
    ## The ConfigMaps are mounted into /etc/alertmanager/configmaps/.
    ##
    configMaps: []

    ## Log level for Alertmanager to be configured with.
    ##
    logLevel: info

    ## Size is the expected size of the alertmanager cluster. The controller will eventually make the size of the
    ## running cluster equal to the expected size.
    replicas: 1

    ## Time duration Alertmanager shall retain data for. Default is '120h', and must match the regular expression
    ## [0-9]+(ms|s|m|h) (milliseconds seconds minutes hours).
    ##
    retention: 120h

    ## Storage is the definition of how storage will be used by the Alertmanager instances.
    ## ref: https://github.com/coreos/prometheus-operator/blob/master/Documentation/user-guides/storage.md
    ##
    storage:
      volumeClaimTemplate:
        spec:
          storageClassName: standard
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: 50Gi
    


    ## 	The external URL the Alertmanager instances will be available under. This is necessary to generate correct URLs. This is necessary if Alertmanager is not served from root of a DNS name.	string	false
    ##
    externalUrl:

    ## 	The route prefix Alertmanager registers HTTP handlers for. This is useful, if using ExternalURL and a proxy is rewriting HTTP routes of a request, and the actual ExternalURL is still true,
    ## but the server serves requests under a different route prefix. For example for use with kubectl proxy.
    ##
    routePrefix: /

    ## If set to true all actions on the underlying managed objects are not going to be performed, except for delete actions.
    ##
    paused: false

    ## Define which Nodes the Pods are scheduled on.
    ## ref: https://kubernetes.io/docs/user-guide/node-selection/
    ##
    nodeSelector: {}

    ## Define resources requests and limits for single Pods.
    ## ref: https://kubernetes.io/docs/user-guide/compute-resources/
    ##
    resources: {}
    # requests:
    #   memory: 400Mi

    ## Pod anti-affinity can prevent the scheduler from placing Prometheus replicas on the same node.
    ## The default value "soft" means that the scheduler should *prefer* to not schedule two replica pods onto the same node but no guarantee is provided.
    ## The value "hard" means that the scheduler is *required* to not schedule two replica pods onto the same node.
    ## The value "" will disable pod anti-affinity so that no anti-affinity rules will be configured.
    ##
    podAntiAffinity: ""

    ## If anti-affinity is enabled sets the topologyKey to use for anti-affinity.
    ## This can be changed to, for example, failure-domain.beta.kubernetes.io/zone
    ##
    podAntiAffinityTopologyKey: kubernetes.io/hostname

    ## If specified, the pod's tolerations.
    ## ref: https://kubernetes.io/docs/concepts/configuration/taint-and-toleration/
    ##
    tolerations: []
    # - key: "key"
    #   operator: "Equal"
    #   value: "value"
    #   effect: "NoSchedule"

    ## SecurityContext holds pod-level security attributes and common container settings.
    ## This defaults to non root user with uid 1000 and gid 2000.	*v1.PodSecurityContext	false
    ## ref: https://kubernetes.io/docs/tasks/configure-pod-container/security-context/
    ##
    securityContext:
      runAsNonRoot: true
      runAsUser: 1000
      fsGroup: 2000

    ## ListenLocal makes the Alertmanager server listen on loopback, so that it does not bind against the Pod IP.
    ## Note this is only for the Alertmanager UI, not the gossip communication.
    ##
    listenLocal: false

    ## Containers allows injecting additional containers. This is meant to allow adding an authentication proxy to an Alertmanager pod.
    ##
    containers: []

    ## Priority class assigned to the Pods
    ##
    priorityClassName: ""

    ## AdditionalPeers allows injecting a set of additional Alertmanagers to peer with to form a highly available cluster.
    ##
    additionalPeers: []

## Using default values from https://github.com/helm/charts/blob/master/stable/grafana/values.yaml
##
grafana:
  enabled: true

  ## Deploy default dashboards.
  ##
  defaultDashboardsEnabled: true

  adminPassword: prom-operator

  ingress:
    ## If true, Prometheus Ingress will be created
    ##
    enabled: false

    ## Annotations for Prometheus Ingress
    ##
    annotations: {}
      # kubernetes.io/ingress.class: nginx
      # kubernetes.io/tls-acme: "true"

    ## Labels to be added to the Ingress
    ##
    labels: {}

    ## Hostnames.
    ## Must be provided if Ingress is enable.
    ##
    # hosts:
    #   - prometheus.domain.com
    hosts: []

    ## TLS configuration for prometheus Ingress
    ## Secret must be manually created in the namespace
    ##
    tls: []
    # - secretName: prometheus-general-tls
    #   hosts:
    #   - prometheus.example.com

  sidecar:
    dashboards:
      enabled: true
      label: grafana_dashboard
    datasources:
      enabled: true
      label: grafana_datasource

  extraConfigmapMounts: []
  # - name: certs-configmap
  #   mountPath: /etc/grafana/ssl/
  #   configMap: certs-configmap
  #   readOnly: true


## Component scraping the kube api server
##
kubeApiServer:
  enabled: true
  tlsConfig:
    serverName: kubernetes
    insecureSkipVerify: false

  serviceMonitor:
    jobLabel: component
    selector:
      matchLabels:
        component: apiserver
        provider: kubernetes

## Component scraping the kubelet and kubelet-hosted cAdvisor
##
kubelet:
  enabled: true
  namespace: kube-system

  serviceMonitor:
    ## Enable scraping the kubelet over https. For requirements to enable this see
    ## https://github.com/coreos/prometheus-operator/issues/926
    ##
    https: true

## Component scraping the kube controller manager
##
kubeControllerManager:
  enabled: true

  ## If your kube controller manager is not deployed as a pod, specify IPs it can be found on
  ##
  endpoints: []
  # - 10.141.4.22
  # - 10.141.4.23
  # - 10.141.4.24

  ## If using kubeControllerManager.endpoints only the port and targetPort are used
  ##
  service:
    port: 10252
    targetPort: 10252
    selector:
      k8s-app: kube-controller-manager
## Component scraping coreDns. Use either this or kubeDns
##
coreDns:
  enabled: true
  service:
    port: 9153
    targetPort: 9153
    selector:
      k8s-app: coredns

## Component scraping kubeDns. Use either this or coreDns
##
kubeDns:
  enabled: false
  service:
    selector:
      k8s-app: kube-dns
## Component scraping etcd
##
kubeEtcd:
  enabled: true

  ## If your etcd is not deployed as a pod, specify IPs it can be found on
  ##
  endpoints: []
  # - 10.141.4.22
  # - 10.141.4.23
  # - 10.141.4.24

  ## Etcd service. If using kubeEtcd.endpoints only the port and targetPort are used
  ##
  service:
    port: 4001
    targetPort: 4001
    selector:
      k8s-app: etcd-server

  ## Configure secure access to the etcd cluster by loading a secret into prometheus and
  ## specifying security configuration below. For example, with a secret named etcd-client-cert
  ##
  ## serviceMonitor:
  ##   scheme: https
  ##   insecureSkipVerify: false
  ##   serverName: localhost
  ##   caFile: /etc/prometheus/secrets/etcd-client-cert/etcd-ca
  ##   certFile: /etc/prometheus/secrets/etcd-client-cert/etcd-client
  ##   keyFile: /etc/prometheus/secrets/etcd-client-cert/etcd-client-key
  ##
  serviceMonitor:
    scheme: http
    insecureSkipVerify: false
    serverName: ""
    caFile: ""
    certFile: ""
    keyFile: ""


## Component scraping kube scheduler
##
kubeScheduler:
  enabled: true

  ## If your kube scheduler is not deployed as a pod, specify IPs it can be found on
  ##
  endpoints: []
  # - 10.141.4.22
  # - 10.141.4.23
  # - 10.141.4.24

  ## If using kubeScheduler.endpoints only the port and targetPort are used
  ##
  service:
    port: 10251
    targetPort: 10251
    selector:
      k8s-app: kube-scheduler

## Component scraping kube state metrics
##
kubeStateMetrics:
  enabled: true

## Configuration for kube-state-metrics subchart
##
kube-state-metrics:
  rbac:
    create: true
  podSecurityPolicy:
    enabled: true

## Deploy node exporter as a daemonset to all nodes
##
nodeExporter:
  enabled: true

  ## Use the value configured in prometheus-node-exporter.podLabels
  ##
  jobLabel: jobLabel

## Configuration for prometheus-node-exporter subchart
##
prometheus-node-exporter:
  podLabels:
    ## Add the 'node-exporter' label to be used by serviceMonitor to match standard common usage in rules and grafana dashboards
    ##
    jobLabel: node-exporter
  extraArgs:
    - --collector.filesystem.ignored-mount-points=^/(dev|proc|sys|var/lib/docker/.+)($|/)
    - --collector.filesystem.ignored-fs-types=^(autofs|binfmt_misc|cgroup|configfs|debugfs|devpts|devtmpfs|fusectl|hugetlbfs|mqueue|overlay|proc|procfs|pstore|rpc_pipefs|securityfs|sysfs|tracefs)$

## Manages Prometheus and Alertmanager components
##
prometheusOperator:
  enabled: true

  ## Service account for Alertmanager to use.
  ## ref: https://kubernetes.io/docs/tasks/configure-pod-container/configure-service-account/
  ##
  serviceAccount:
    create: true
    name: ""

  ## Configuration for Prometheus operator service
  ##
  service:
    annotations: {}
    labels: {}
    clusterIP: ""

  ## Port to expose on each node
  ## Only used if service.type is 'NodePort'
  ##
    nodePort: 38080


  ## Loadbalancer IP
  ## Only use if service.type is "loadbalancer"
  ##
    loadBalancerIP: ""
    loadBalancerSourceRanges: []

  ## Service type
  ## NodepPort, ClusterIP, loadbalancer
  ##
    type: ClusterIP

    ## List of IP addresses at which the Prometheus server service is available
    ## Ref: https://kubernetes.io/docs/user-guide/services/#external-ips
    ##
    externalIPs: []

  ## Deploy CRDs used by Prometheus Operator.
  ##
  createCustomResource: true

  ## Customize CRDs API Group
  crdApiGroup: monitoring.coreos.com

  ## Attempt to clean up CRDs created by Prometheus Operator.
  ##
  cleanupCustomResource: false

  ## Labels to add to the operator pod
  ##
  podLabels: {}

  ## Assign a PriorityClassName to pods if set
  # priorityClassName: ""

  ## If true, the operator will create and maintain a service for scraping kubelets
  ## ref: https://github.com/coreos/prometheus-operator/blob/master/helm/prometheus-operator/README.md
  ##
  kubeletService:
    enabled: true
    namespace: kube-system

  ## Create a servicemonitor for the operator
  ##
  serviceMonitor:
    selfMonitor: true

  ## Resource limits & requests
  ##
  resources: {}
  # limits:
  #   cpu: 200m
  #   memory: 200Mi
  # requests:
  #   cpu: 100m
  #   memory: 100Mi

  ## Define which Nodes the Pods are scheduled on.
  ## ref: https://kubernetes.io/docs/user-guide/node-selection/
  ##
  nodeSelector: {}

  ## Tolerations for use with node taints
  ## ref: https://kubernetes.io/docs/concepts/configuration/taint-and-toleration/
  ##
  tolerations: []
  # - key: "key"
  #   operator: "Equal"
  #   value: "value"
  #   effect: "NoSchedule"

  ## Assign the prometheus operator to run on specific nodes
  ## ref: https://kubernetes.io/docs/concepts/configuration/assign-pod-node/
  ##
  affinity: {}
  # requiredDuringSchedulingIgnoredDuringExecution:
  #   nodeSelectorTerms:
  #   - matchExpressions:
  #     - key: kubernetes.io/e2e-az-name
  #       operator: In
  #       values:
  #       - e2e-az1
  #       - e2e-az2

  securityContext:
    runAsNonRoot: true
    runAsUser: 65534

  ## Prometheus-operator image
  ##
  image:
    repository: quay.io/coreos/prometheus-operator
    tag: v0.26.0
    pullPolicy: IfNotPresent

  ## Configmap-reload image to use for reloading configmaps
  ##
  configmapReloadImage:
    repository: quay.io/coreos/configmap-reload
    tag: v0.0.1

  ## Prometheus-config-reloader image to use for config and rule reloading
  ##
  prometheusConfigReloaderImage:
    repository: quay.io/coreos/prometheus-config-reloader
    tag: v0.26.0

  ## Hyperkube image to use when cleaning up
  ##
  hyperkubeImage:
    repository: k8s.gcr.io/hyperkube
    tag: v1.12.1
    pullPolicy: IfNotPresent

## Deploy a Prometheus instance
##
prometheus:

  enabled: true

  ## Service account for Prometheuses to use.
  ## ref: https://kubernetes.io/docs/tasks/configure-pod-container/configure-service-account/
  ##
  serviceAccount:
    create: true
    name: ""

  ## Configuration for Prometheus service
  ##
  service:
    annotations: {}
    labels: {}
    clusterIP: ""

    ## List of IP addresses at which the Prometheus server service is available
    ## Ref: https://kubernetes.io/docs/user-guide/services/#external-ips
    ##
    externalIPs: []

    ## Port to expose on each node
    ## Only used if service.type is 'NodePort'
    ##
    nodePort: 39090

    ## Loadbalancer IP
    ## Only use if service.type is "loadbalancer"
    loadBalancerIP: ""
    loadBalancerSourceRanges: []
    ## Service type
    ##
    type: ClusterIP

  rbac:
    ## Create role bindings in the specified namespaces, to allow Prometheus monitoring
    ## a role binding in the release namespace will always be created.
    ##
    roleNamespaces:
      - kube-system

  ## Configure pod disruption budgets for Prometheus
  ## ref: https://kubernetes.io/docs/tasks/run-application/configure-pdb/#specifying-a-poddisruptionbudget
  ## This configuration is immutable once created and will require the PDB to be deleted to be changed
  ## https://github.com/kubernetes/kubernetes/issues/45398
  ##
  podDisruptionBudget:
    enabled: false
    minAvailable: 1
    maxUnavailable: ""

  ingress:
    enabled: false
    annotations: {}
    labels: {}

    ## Hostnames.
    ## Must be provided if Ingress is enabled.
    ##
    # hosts:
    #   - prometheus.domain.com
    hosts: []

    ## TLS configuration for Prometheus Ingress
    ## Secret must be manually created in the namespace
    ##
    tls: []
      # - secretName: prometheus-general-tls
      #   hosts:
      #     - prometheus.example.com

  serviceMonitor:
    selfMonitor: true

  ## Settings affecting prometheusSpec
  ## ref: https://github.com/coreos/prometheus-operator/blob/master/Documentation/api.md#prometheusspec
  ##
  prometheusSpec:

    ## Interval between consecutive scrapes.
    ##
    scrapeInterval: ""

    ## Interval between consecutive evaluations.
    ##
    evaluationInterval: ""

    ## ListenLocal makes the Prometheus server listen on loopback, so that it does not bind against the Pod IP.
    ##
    listenLocal: false

    ## Image of Prometheus.
    ##
    image:
      repository: quay.io/prometheus/prometheus
      tag: v2.5.0

    #  repository: quay.io/coreos/prometheus
    #  tag: v2.5.0

    ## Tolerations for use with node taints
    ## ref: https://kubernetes.io/docs/concepts/configuration/taint-and-toleration/
    ##
    tolerations: []
    #  - key: "key"
    #    operator: "Equal"
    #    value: "value"
    #    effect: "NoSchedule"

    ## Alertmanagers to which alerts will be sent
    ## ref: https://github.com/coreos/prometheus-operator/blob/master/Documentation/api.md#alertmanagerendpoints
    ##
    ## Default configuration will connect to the alertmanager deployed as part of this release
    ##
    alertingEndpoints: []
    # - name: ""
    #   namespace: ""
    #   port: http
    #   scheme: http

    ## External labels to add to any time series or alerts when communicating with external systems
    ##
    externalLabels: {}

    ## External URL at which Prometheus will be reachable.
    ##
    externalUrl: ""

    ## Define which Nodes the Pods are scheduled on.
    ## ref: https://kubernetes.io/docs/user-guide/node-selection/
    ##
    nodeSelector: {}

    ## Secrets is a list of Secrets in the same namespace as the Prometheus object, which shall be mounted into the Prometheus Pods.
    ## The Secrets are mounted into /etc/prometheus/secrets/. Secrets changes after initial creation of a Prometheus object are not
    ## reflected in the running Pods. To change the secrets mounted into the Prometheus Pods, the object must be deleted and recreated
    ## with the new list of secrets.
    ##
    secrets: []

    ## ConfigMaps is a list of ConfigMaps in the same namespace as the Prometheus object, which shall be mounted into the Prometheus Pods.
    ## The ConfigMaps are mounted into /etc/prometheus/configmaps/.
    ##
    configMaps: []

    ## Namespaces to be selected for PrometheusRules discovery.
    ## If unspecified, only the same namespace as the Prometheus object is in is used.
    ##
    ruleNamespaceSelector: {}

    ## If true, a nil or {} value for prometheus.prometheusSpec.ruleSelector will cause the
    ## prometheus resource to be created with selectors based on values in the helm deployment,
    ## which will also match the PrometheusRule resources created
    ##
    ruleSelectorNilUsesHelmValues: true

    ## Rules CRD selector
    ## ref: https://github.com/coreos/prometheus-operator/blob/master/Documentation/design.md
    ## If unspecified the release `app` and `release` will be used as the label selector
    ## to load rules
    ##
    ruleSelector: {}
    ## Example which select all prometheusrules resources
    ## with label "prometheus" with values any of "example-rules" or "example-rules-2"
    # ruleSelector:
    #   matchExpressions:
    #     - key: prometheus
    #       operator: In
    #       values:
    #         - example-rules
    #         - example-rules-2
    #
    ## Example which select all prometheusrules resources with label "role" set to "example-rules"
    # ruleSelector:
    #   matchLabels:
    #     role: example-rules

    ## If true, a nil or {} value for prometheus.prometheusSpec.serviceMonitorSelector will cause the
    ## prometheus resource to be created with selectors based on values in the helm deployment,
    ## which will also match the servicemonitors created
    ##
    serviceMonitorSelectorNilUsesHelmValues: true

    ## serviceMonitorSelector will limit which servicemonitors are used to create scrape
    ## configs in Prometheus. See serviceMonitorSelectorUseHelmLabels
    ##
    serviceMonitorSelector: {}

    # serviceMonitorSelector: {}
    #   matchLabels:
    #     prometheus: somelabel

    ## serviceMonitorNamespaceSelector will limit namespaces from which serviceMonitors are used to create scrape
    ## configs in Prometheus. By default all namespaces will be used
    ##
    serviceMonitorNamespaceSelector: {}

    ## How long to retain metrics
    ##
    retention: 10d

    ## If true, the Operator won't process any Prometheus configuration changes
    ##
    paused: false

    ## Number of Prometheus replicas desired
    ##
    replicas: 1

    ## Log level for Prometheus be configured in
    ##
    logLevel: info

    ## Prefix used to register routes, overriding externalUrl route.
    ## Useful for proxies that rewrite URLs.
    ##
    routePrefix: /

    ## Standard object’s metadata. More info: https://github.com/kubernetes/community/blob/master/contributors/devel/api-conventions.md#metadata
    ## Metadata Labels and Annotations gets propagated to the prometheus pods.
    ##
    podMetadata: {}
    # labels:
    #   app: prometheus
    #   k8s-app: prometheus

    ## Pod anti-affinity can prevent the scheduler from placing Prometheus replicas on the same node.
    ## The default value "soft" means that the scheduler should *prefer* to not schedule two replica pods onto the same node but no guarantee is provided.
    ## The value "hard" means that the scheduler is *required* to not schedule two replica pods onto the same node.
    ## The value "" will disable pod anti-affinity so that no anti-affinity rules will be configured.
    podAntiAffinity: ""

    ## If anti-affinity is enabled sets the topologyKey to use for anti-affinity.
    ## This can be changed to, for example, failure-domain.beta.kubernetes.io/zone
    ##
    podAntiAffinityTopologyKey: kubernetes.io/hostname

    ## The remote_read spec configuration for Prometheus.
    ## ref: https://github.com/coreos/prometheus-operator/blob/master/Documentation/api.md#remotereadspec
    remoteRead: {}
    # - url: http://remote1/read

    ## The remote_write spec configuration for Prometheus.
    ## ref: https://github.com/coreos/prometheus-operator/blob/master/Documentation/api.md#remotewritespec
    remoteWrite: {}
      # remoteWrite:
      #   - url: http://remote1/push

    ## Resource limits & requests
    ##
    resources: {}
    # requests:
    #   memory: 400Mi

    ## Prometheus StorageSpec for persistent data
    ## ref: https://github.com/coreos/prometheus-operator/blob/master/Documentation/user-guides/storage.md
    ##
    storageSpec:
      volumeClaimTemplate:
        spec:
          storageClassName: standard
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: 50Gi
    

    ## AdditionalScrapeConfigs allows specifying additional Prometheus scrape configurations. Scrape configurations
    ## are appended to the configurations generated by the Prometheus Operator. Job configurations must have the form
    ## as specified in the official Prometheus documentation:
    ## https://prometheus.io/docs/prometheus/latest/configuration/configuration/#<scrape_config>. As scrape configs are
    ## appended, the user is responsible to make sure it is valid. Note that using this feature may expose the possibility
    ## to break upgrades of Prometheus. It is advised to review Prometheus release notes to ensure that no incompatible
    ## scrape configs are going to break Prometheus after the upgrade.
    ##
    ## The scrape configuraiton example below will find master nodes, provided they have the name .*mst.*, relabel the
    ## port to 2379 and allow etcd scraping provided it is running on all Kubernetes master nodes
    ##
    additionalScrapeConfigs: []
    # - job_name: kube-etcd
    #   kubernetes_sd_configs:
    #     - role: node
    #   scheme: https
    #   tls_config:
    #     ca_file:   /etc/prometheus/secrets/etcd-client-cert/etcd-ca
    #     cert_file: /etc/prometheus/secrets/etcd-client-cert/etcd-client
    #     key_file:  /etc/prometheus/secrets/etcd-client-cert/etcd-client-key
    #   relabel_configs:
    #   - action: labelmap
    #     regex: __meta_kubernetes_node_label_(.+)
    #   - source_labels: [__address__]
    #     action: replace
    #     target_label: __address__
    #     regex: ([^:;]+):(\d+)
    #     replacement: ${1}:2379
    #   - source_labels: [__meta_kubernetes_node_name]
    #     action: keep
    #     regex: .*mst.*
    #   - source_labels: [__meta_kubernetes_node_name]
    #     action: replace
    #     target_label: node
    #     regex: (.*)
    #     replacement: ${1}
    #   metric_relabel_configs:
    #   - regex: (kubernetes_io_hostname|failure_domain_beta_kubernetes_io_region|beta_kubernetes_io_os|beta_kubernetes_io_arch|beta_kubernetes_io_instance_type|failure_domain_beta_kubernetes_io_zone)
    #     action: labeldrop


    ## AdditionalAlertManagerConfigs allows for manual configuration of alertmanager jobs in the form as specified
    ## in the official Prometheus documentation https://prometheus.io/docs/prometheus/latest/configuration/configuration/#<alertmanager_config>.
    ## AlertManager configurations specified are appended to the configurations generated by the Prometheus Operator.
    ## As AlertManager configs are appended, the user is responsible to make sure it is valid. Note that using this
    ## feature may expose the possibility to break upgrades of Prometheus. It is advised to review Prometheus release
    ## notes to ensure that no incompatible AlertManager configs are going to break Prometheus after the upgrade.
    ##
    additionalAlertManagerConfigs: []
    # - consul_sd_configs:
    #   - server: consul.dev.test:8500
    #     scheme: http
    #     datacenter: dev
    #     tag_separator: ','
    #     services:
    #       - metrics-prometheus-alertmanager

    ## AdditionalAlertRelabelConfigs allows specifying Prometheus alert relabel configurations. Alert relabel configurations specified are appended
    ## to the configurations generated by the Prometheus Operator. Alert relabel configurations specified must have the form as specified in the
    ## official Prometheus documentation: https://prometheus.io/docs/prometheus/latest/configuration/configuration/#alert_relabel_configs.
    ## As alert relabel configs are appended, the user is responsible to make sure it is valid. Note that using this feature may expose the
    ## possibility to break upgrades of Prometheus. It is advised to review Prometheus release notes to ensure that no incompatible alert relabel
    ## configs are going to break Prometheus after the upgrade.
    ##
    additionalAlertRelabelConfigs: []
    # - separator: ;
    #   regex: prometheus_replica
    #   replacement: $1
    #   action: labeldrop

    ## SecurityContext holds pod-level security attributes and common container settings.
    ## This defaults to non root user with uid 1000 and gid 2000.
    ## https://github.com/coreos/prometheus-operator/blob/master/Documentation/api.md
    ##
    securityContext:
      runAsNonRoot: true
      runAsUser: 1000
      fsGroup: 2000

    ## 	Priority class assigned to the Pods
    ##
    priorityClassName: ""

    ## Thanos configuration allows configuring various aspects of a Prometheus server in a Thanos environment.
    ## This section is experimental, it may change significantly without deprecation notice in any release.
    ## This is experimental and may change significantly without backward compatibility in any release.
    ## ref: https://github.com/coreos/prometheus-operator/blob/master/Documentation/api.md#thanosspec
    ##
    thanos: {}

    ## Containers allows injecting additional containers. This is meant to allow adding an authentication proxy to a Prometheus pod.
    ##
    containers: []

    ## Enable additional scrape configs that are managed externally to this chart. Note that the prometheus
    ## will fail to provision if the correct secret does not exist.
    ##
    additionalScrapeConfigsExternal: false

  additionalServiceMonitors: []
  ## Name of the ServiceMonitor to create
  ##
  # - name: ""

    ## Additional labels to set used for the ServiceMonitorSelector. Together with standard labels from
    ## the chart
    ##
    # additionalLabels: {}

    ## Service label for use in assembling a job name of the form <label value>-<port>
    ## If no label is specified, the service name is used.
    ##
    # jobLabel: ""

    ## Label selector for services to which this ServiceMonitor applies
    ##
    # selector: {}

    ## Namespaces from which services are selected
    ##
    # namespaceSelector:
      ## Match any namespace
      ##
      # any: false

      ## Explicit list of namespace names to select
      ##
      # matchNames: []

    ## Endpoints of the selected service to be monitored
    ##
    # endpoints: []
      ## Name of the endpoint's service port
      ## Mutually exclusive with targetPort
      # - port: ""

      ## Name or number of the endpoint's target port
      ## Mutually exclusive with port
      # - targetPort: ""

      ## File containing bearer token to be used when scraping targets
      ##
      #   bearerTokenFile: ""

      ## Interval at which metrics should be scraped
      ##
      #   interval: 30s

      ## HTTP path to scrape for metrics
      ##
      #   path: /metrics

      ## HTTP scheme to use for scraping
      ##
      #   scheme: http

      ## TLS configuration to use when scraping the endpoint
      ##
      #   tlsConfig:

          ## Path to the CA file
          ##
          # caFile: ""

          ## Path to client certificate file
          ##
          # certFile: ""

          ## Skip certificate verification
          ##
          # insecureSkipVerify: false

          ## Path to client key file
          ##
          # keyFile: ""

          ## Server name used to verify host name
          ##
          # serverName: ""

Flaky test: TestTailFuzz

This test failed during presubmit, and it's unrelated to my PR (which was documentation only): https://travis-ci.com/Stackdriver/stackdriver-prometheus-sidecar/builds/94064905

@fabxc how can we investigate this?

permission denied despite creating service account

I followed these steps to setup my Prometheus + Stackdriver stack.

Setup Prometheus sidecar with the Prometheus operator
Used GKE IAM instructions to setup a new nodepool for my existing GKE cluster and scaled the old node pool to zero

level=info ts=2019-08-04T04:21:28.50604042Z caller=main.go:296 msg="Starting Stackdriver Prometheus sidecar" version="(version=HEAD, branch=master, revision=453838cff46ee8a17f7675696a97256475bb39e7)"
level=info ts=2019-08-04T04:21:28.506422485Z caller=main.go:297 build_context="(go=go1.12, user=kbuilder@kokoro-gcp-ubuntu-prod-1535194210, date=20190520-14:47:15)"
level=info ts=2019-08-04T04:21:28.506537834Z caller=main.go:298 host_details="(Linux 4.14.127+ #1 SMP Tue Jun 18 23:08:40 PDT 2019 x86_64 prometheus-prometheus-0 (none))"
level=info ts=2019-08-04T04:21:28.506674208Z caller=main.go:299 fd_limits="(soft=1048576, hard=1048576)"
level=info ts=2019-08-04T04:21:28.512748856Z caller=main.go:564 msg="Web server started"
level=info ts=2019-08-04T04:21:28.516468444Z caller=main.go:545 msg="Stackdriver client started"
level=info ts=2019-08-04T04:22:31.518073511Z caller=manager.go:153 component="Prometheus reader" msg="Starting Prometheus reader..."
level=info ts=2019-08-04T04:22:31.531530003Z caller=manager.go:215 component="Prometheus reader" msg="reached first record after start offset" start_offset=0 skipped_records=0
level=warn ts=2019-08-04T04:22:31.631923445Z caller=queue_manager.go:546 component=queue_manager msg="Unrecoverable error sending samples to remote storage" err="rpc error: code = PermissionDenied desc = Permission monitoring.timeSeries.create denied (or the resource may not exist)."

Document how to use --filter to send specific metrics

#45 implements the filtering mechanism, by passing --filter=... repeatedly we can elect to send only specific metrics to stackdriver.

This doesn't seem to be documented anywhere (except in the help text). It would be nice to have a few examples in place so that someone can easily set this up.

--include flag depreciated?

Running go get github.com/Stackdriver/stackdriver-prometheus-sidecar/... fetches an unknown version of the binary. This is indicated by running stackdriver-prometheus-sidecar --version, which returns:

prometheus, version  (branch: , revision: )
  build user:
  build date:
  go version:       go1.12.5

When running --help we have the following output:

stackdriver-prometheus-sidecar --helpusage: stackdriver-prometheus-sidecar --stackdriver.project-id=STACKDRIVER.PROJECT-ID [<flags>]

The Prometheus monitoring server

Flags:
  -h, --help                     Show context-sensitive help (also try --help-long and --help-man).
      --version                  Show application version.
      --config-file=CONFIG-FILE  A configuration file.
      --stackdriver.project-id=STACKDRIVER.PROJECT-ID
                                 The Google project ID where Stackdriver will store the metrics.
      --stackdriver.api-address=https://monitoring.googleapis.com:443/
                                 Address of the Stackdriver Monitoring API.
      --stackdriver.kubernetes.location=STACKDRIVER.KUBERNETES.LOCATION
                                 Value of the 'location' label in the Kubernetes Stackdriver MonitoredResources.
      --stackdriver.kubernetes.cluster-name=STACKDRIVER.KUBERNETES.CLUSTER-NAME
                                 Value of the 'cluster_name' label in the Kubernetes Stackdriver MonitoredResources.
      --stackdriver.generic.location=STACKDRIVER.GENERIC.LOCATION
                                 Location for metrics written with the generic resource, e.g. a cluster or data center name.
      --stackdriver.generic.namespace=STACKDRIVER.GENERIC.NAMESPACE
                                 Namespace for metrics written with the generic resource, e.g. a cluster or data center name.
      --stackdriver.metrics-prefix=STACKDRIVER.METRICS-PREFIX
                                 Customized prefix for Stackdriver metrics. If not set, external.googleapis.com/prometheus will be used
      --stackdriver.use-gke-resource
                                 Whether to use the legacy gke_container MonitoredResource type instead of k8s_container
      --prometheus.wal-directory="data/wal"
                                 Directory from where to read the Prometheus TSDB WAL.
      --prometheus.api-address=http://127.0.0.1:9090/
                                 Address to listen on for UI, API, and telemetry.
      --web.listen-address="0.0.0.0:9091"
                                 Address to listen on for UI, API, and telemetry.
      --filter=FILTER ...        PromQL-style label matcher which must pass for a series to be forwarded to Stackdriver. May be repeated.
      --log.level=info           Only log messages with the given severity or above. One of: [debug, info, warn, error]

note that there's no flag for --include, which is still references in the main README.md for the repo. Has this capability been depreciated?

Support GCE targets

Maintain series cache

I've been thinking about how we can sanely clean up cached series (and their metadata) that we hold in memory.

Naively, we could just drop all state every few hours and re-read the WAL from the beginning. We could store the last timestamp t we send to Stackdriver and ignore all samples before t - 5m (or similar margin). That's fuzzy but would be relatively safe. More accurately, we could extend the tailer to report the current (segment, offset) (tricky with buffered reader) and ignore samples before the last offset.
We do not interrupt the current tailing. Instead, we start a background process that does a single scan over all current segments (w/o tailing) and only looks at series records. We then drop all series in the cache that we didn't see in that scan. To be on the safe side, we probably have to do a check (synchronized with the tailer) of the currently highest series ID to ensure we don't garbage collect series inserted after we completed our scan. We may also need some handling in reading the last segment without the last page falsely signaling a corruption.

The second option seems simpler and less disruptive.

Log when the metadata lookup fails

This is useful for debugging, especially when the lookup fails because the job and instance labels can't be found. This is a common mistake when setting up recording rules and can be caused with relabeling. We need to be careful not to spam the log files.

Update all container image references to official release images

We are referencing images with restricted access in a few places. This is just a reminder to update all those with official ones before making the project public.

Tailing WAL failed

This is specifically not working on the 0.5.1 release (so far tested), 0.4.1 is working.

I believe there's an issue relating to how the path for the checkpoint file is being resolved. The culprit is the following:

stackdriver-prometheus-sidecar/tail/tail.go

Line 65 in 3e6c59f

t.cur, err = wal.NewSegmentsReader(filepath.Join(dir, cpdir))

Why are we joining dir and cpdir.

as we set something like --prometheus.wal-directory=/prometheus/wal which is set for dir and cpdir is returned from tsdb.LastCheckpoint(dir), which will return the full path of the checkpoint found. So we end up concatenating to the following:

/prometheus/wal + /prometheus/wal/checkpoint.002016

I'm getting the following errors in this scenario:

level=info ts=2019-09-02T21:52:23.197Z caller=main.go:303 msg="Starting Stackdriver Prometheus sidecar" version="(version=, branch=, revision=)"
level=info ts=2019-09-02T21:52:23.197Z caller=main.go:304 build_context="(go=go1.12, user=, date=)"
level=info ts=2019-09-02T21:52:23.197Z caller=main.go:305 host_details="(Linux 4.14.127+ #1 SMP Tue Jun 18 23:08:40 PDT 2019 x86_64 prometheus-777fd6c946-hrchs (none))"
level=info ts=2019-09-02T21:52:23.197Z caller=main.go:306 fd_limits="(soft=1048576, hard=1048576)"
level=error ts=2019-09-02T21:52:23.211Z caller=main.go:394 msg="Tailing WAL failed" err="open checkpoint: list segment in dir:/prometheus/wal/prometheus/wal/checkpoint.002016: open /prometheus/wal/prometheus/wal/checkpoint.002016: no such file or directory"

Can someone advise if I'm just configuring this incorrectly or not?

The configuration i'm using for the sidecar is as follows

- name: sidecar
    image: gcr.io/xxx-xxx-xxx/stackdriver-prometheus/stackdriver-prometheus-sidecar:0.5.1
    imagePullPolicy: Always
    args:
    - "--stackdriver.project-id=xxx-xxx-xxx"
    - "--prometheus.wal-directory=/prometheus/wal"
    - "--stackdriver.kubernetes.location=x-xxx1"
    - "--stackdriver.kubernetes.cluster-name=xxx-xxx-xxx"
    - "--stackdriver.generic.location=xxx-xxx-xxx"
    ports:
    - name: sidecar
      containerPort: 9091
    volumeMounts:
    - name: storage-volume
      mountPath: /prometheus

Parsing order

Hi all,

I've been implementing the sidecar across multiple environments.
On my tests I've come across an issue that i'd like to bring up.

level=debug ts=2019-01-18T18:20:44.295817108Z caller=client.go:162 component=storage msg="Partial failure calling CreateTimeSeries" err="rpc error: code = InvalidArgument desc = Field timeSeries[0].points[0].interval.end had an invalid value of \"2019-01-14T23:14:16.842-08:00\": Data points cannot be written more than 24h in the past."

I understand that there is a time limit of 24 hours, because you wouldn't want to store historical metric data, however, after trying to parse ~150 entries that are older than the 24 limit the container seems to stop attempting to parse anything younger than that.
You'll find attached the logs that the sidecar outputted.
logs.txt

I guess my question here is, why doesn't the parsing start from the most recent entry, rather than looking at historical data first?
If there is a hard limit of 24 hours, it would make sense to work your way towards that limit and once you reach it, cut off, rather than trying to parse entries that have the potential of being rejected.

Could you clarify this design decision for me and advise me what to do regarding the problem I currently have?

Thank you very much for your time.
Miguel

Getting Counters into SD

Hi,
Is it possible to get Counter-type metrics into SD, high-cardinality or not? I see the Cumulative aggregator, and I could manually create one for each of my individual counters, but that's not exactly ideal.

Prometheus has a Counter:
https://prometheus.io/docs/concepts/metric_types/

StackDriver has the same concept only they call it Cumulative:
https://cloud.google.com/monitoring/api/v3/metrics-details#metric-kinds

& I'd like to just map them 1-to-1 without having to update the stackdriver-prometheus-sidecar config every time I add a new one. I've been using Gauges for everything instead, but that feels a little dirty.

Thanks!

Test fails caused by github.com/apache/thrift and git.apache.org/thrift.git

Seeing compilation error when running test locally with make test:

ycchou@ycchou44:~/src/stackdriver-prometheus-sidecar$ make test
>> running all tests
GO111MODULE=on go test  -mod=vendor ./...
?       github.com/Stackdriver/stackdriver-prometheus-sidecar/bench     [no test files]
compilation error :go: github.com/apache/[email protected] used for two different module paths (git.apache.org/thrift.git and github.com/apache/thrift)
 
FAIL    github.com/Stackdriver/stackdriver-prometheus-sidecar/cmd/stackdriver-prometheus-sidecar        0.252s
ok      github.com/Stackdriver/stackdriver-prometheus-sidecar/metadata  (cached)
ok      github.com/Stackdriver/stackdriver-prometheus-sidecar/retrieval (cached)
ok      github.com/Stackdriver/stackdriver-prometheus-sidecar/stackdriver       (cached)
ok      github.com/Stackdriver/stackdriver-prometheus-sidecar/tail      (cached)
ok      github.com/Stackdriver/stackdriver-prometheus-sidecar/targets   (cached)
FAIL
make: *** [Makefile:74: test] Error 1

Seems like there some other code will use git.apache.org/thrift.git even we merge the #170 to replace git.apache.org with github.com/apache.

Cannot build within GOPATH: build flag -mod=vendor only valid when using modules

Runing the build from within a GOPATH fails with:

:stackdriver-prometheus-sidecar (master)$ make docker
>> fetching promu
curl -s -L https://github.com/prometheus/promu/releases/download/v0.5.0/promu-0.5.0.linux-amd64.tar.gz | tar -xvzf - -C /tmp/tmp.zU4GpffJ6u
promu-0.5.0.linux-amd64/
promu-0.5.0.linux-amd64/promu
promu-0.5.0.linux-amd64/NOTICE
promu-0.5.0.linux-amd64/LICENSE
mkdir -p /home/rye/go/bin
cp /tmp/tmp.zU4GpffJ6u/promu-0.5.0.linux-amd64/promu /home/rye/go/bin/promu
rm -r /tmp/tmp.zU4GpffJ6u
>> building linux amd64 binaries
 >   stackdriver-prometheus-sidecar
build flag -mod=vendor only valid when using modules
!! command failed: build -o /home/rye/go/src/github.com/Stackdriver/stackdriver-prometheus-sidecar/stackdriver-prometheus-sidecar -ldflags -X github.com/Stackdriver/stackdriver-prometheus-sidecar/vendor/github.com/prometheus/common/version.Version=HEAD -X github.com/Stackdriver/stackdriver-prometheus-sidecar/vendor/github.com/prometheus/common/version.Revision=872c1b90aa8b2172f7aee766eeba5a3e921849b3 -X github.com/Stackdriver/stackdriver-prometheus-sidecar/vendor/github.com/prometheus/common/version.Branch=master -X github.com/Stackdriver/stackdriver-prometheus-sidecar/vendor/github.com/prometheus/common/version.BuildUser=rye@localhost -X github.com/Stackdriver/stackdriver-prometheus-sidecar/vendor/github.com/prometheus/common/version.BuildDate=20190827-22:07:23  -extldflags '-static' -mod=vendor -a -tags netgo github.com/Stackdriver/stackdriver-prometheus-sidecar/cmd/stackdriver-prometheus-sidecar: exit status 1
make: *** [Makefile:109: build-linux-amd64] Error 1

This appears to happen because GO111MODULE, while set in Makefile and used for deps target, is not propagated for other build steps.

Prometheus metrics are shown on Stackdriver graphs with a delay which increases over time

Prometheus version: v2.11.1
Prometheus scrapeInterval: 60s
Sidecar image: 0.4.3

I am using the following script to inject to sidecar to Prometheus

#!/bin/sh

# ./patch.sh deployment prometheus

export GCP_PROJECT="myproject"
export DATA_DIR="/data"
export GCP_REGION="australia-southeast1-a"
export KUBE_CLUSTER="mycluster"
export DATA_VOLUME="data-volume"
export KUBE_NAMESPACE="mynamespace"
export SIDECAR_IMAGE_TAG="0.4.3"
export API_ADDRESS="http://127.0.0.1:9090"

set -e
set -u

usage() {
  echo -e "Usage: $0 <deployment|statefulset> <name>\n"
}

if [  $# -le 1 ]; then
  usage
  exit 1
fi

# Override to use a different Docker image name for the sidecar.
export SIDECAR_IMAGE_NAME=${SIDECAR_IMAGE_NAME:-'gcr.io/stackdriver-prometheus/stackdriver-prometheus-sidecar'}

kubectl -n "${KUBE_NAMESPACE}" patch "$1" "$2" --type strategic --patch "
spec:
  template:
    spec:
      containers:
      - name: stackdriver-prometheus-sidecar
        image: ${SIDECAR_IMAGE_NAME}:${SIDECAR_IMAGE_TAG}
        imagePullPolicy: Always
        args:
        - \"--stackdriver.project-id=${GCP_PROJECT}\"
        - \"--prometheus.wal-directory=${DATA_DIR}/wal\"
        - \"--prometheus.api-address=$API_ADDRESS\"
        - \"--stackdriver.kubernetes.location=${GCP_REGION}\"
        - \"--stackdriver.kubernetes.cluster-name=${KUBE_CLUSTER}\"
        - \"--stackdriver.generic.location=${GCP_REGION}\"
        - \"--stackdriver.generic.namespace=${KUBE_NAMESPACE}\"
        - \"--log.level=debug\"
        ports:
        - name: sidecar
          containerPort: 9091
        volumeMounts:
        - name: ${DATA_VOLUME}
          mountPath: ${DATA_DIR}  
"

I have a shared volume between prometheus and sidecar and I can also see the metrics on stackdriver (image attached below)

The metrics start to show as soon as the sidecar is started, however as time passes it gets much slower (image attached)

I tried to change the prometheus scrape interval (increasing and reducing value), however it didn't solve the problem.
I can see the metrics in corresponding grafana, so prometheus is definitely working as expected.

With debug level logs enabled, I couldn't see any relevant error in the sidecar logs

level=debug ts=2019-07-31T06:27:35.185540233Z caller=client.go:98 component=storage msg="is auth enabled" auth=true url=https://monitoring.googleapis.com:443/
level=debug ts=2019-07-31T06:27:35.186037989Z caller=client.go:98 component=storage msg="is auth enabled" auth=true url=https://monitoring.googleapis.com:443/
level=debug ts=2019-07-31T06:27:35.186576582Z caller=client.go:98 component=storage msg="is auth enabled" auth=true url=https://monitoring.googleapis.com:443/
level=debug ts=2019-07-31T06:27:35.186680101Z caller=client.go:98 component=storage msg="is auth enabled" auth=true url=https://monitoring.googleapis.com:443/
level=debug ts=2019-07-31T06:27:40.058755991Z caller=queue_manager.go:318 component=queue_manager msg=QueueManager.calculateDesiredShards samplesIn=148.0967934133026 samplesOut=125.71924386807879 samplesOutDuration=1.139403333028691e+09 timePerSample=9.063078157106189e+06 sizeRate=18171.723407587924 offsetRate=253.1304316466789 desiredShards=144.53212821608827
level=debug ts=2019-07-31T06:27:40.058873253Z caller=queue_manager.go:329 component=queue_manager msg=QueueManager.updateShardsLoop lowerBound=16.099999999999998 desiredShards=144.53212821608827 upperBound=25.3
level=debug ts=2019-07-31T06:27:40.058951793Z caller=queue_manager.go:349 component=queue_manager msg="Remote storage resharding" from=23 to=145
level=debug ts=2019-07-31T06:27:55.058719537Z caller=queue_manager.go:318 component=queue_manager msg=QueueManager.calculateDesiredShards samplesIn=118.47743473064209 samplesOut=119.2020617611297 samplesOutDuration=1.1689497355696194e+09 timePerSample=9.806455679534221e+06 sizeRate=22098.072059403672 offsetRate=202.5043453173431 desiredShards=190.17744566184993
level=debug ts=2019-07-31T06:27:55.058836748Z caller=queue_manager.go:329 component=queue_manager msg=QueueManager.updateShardsLoop lowerBound=101.5 desiredShards=190.17744566184993 upperBound=159.5
level=debug ts=2019-07-31T06:27:55.058859951Z caller=queue_manager.go:352 component=queue_manager msg="Currently resharding, skipping" to=191

Did I miss something in the configuration?

sidecar restarting with corruption message

Hi, my sidecar is restarting pretty much every day (61 times in 50 days). The logs before the last restart are as follows:

level=info ts=2019-04-22T17:24:42.631925269Z caller=main.go:256 msg="Starting Stackdriver Prometheus sidecar" version="(version=0.4.0, branch=master, revision=e246041acf99c8487e1ac73552fb8625339c64a1)"
level=info ts=2019-04-22T17:24:42.632034632Z caller=main.go:257 build_context="(go=go1.11.4, user=kbuilder@kokoro-gcp-ubuntu-prod-217445279, date=20190221-15:24:24)"
level=info ts=2019-04-22T17:24:42.632058208Z caller=main.go:258 host_details="(Linux 4.14.65+ #1 SMP Thu Oct 25 10:42:50 PDT 2018 x86_64 prometheus-749988b6c5-pv6pm (none))"
level=info ts=2019-04-22T17:24:42.632391616Z caller=main.go:259 fd_limits="(soft=1048576, hard=1048576)"
level=info ts=2019-04-22T17:24:42.641737097Z caller=main.go:463 msg="Web server started"
level=info ts=2019-04-22T17:24:42.643462804Z caller=main.go:444 msg="Stackdriver client started"
level=info ts=2019-04-22T17:25:45.648279157Z caller=manager.go:150 component="Prometheus reader" msg="Starting Prometheus reader..."
level=info ts=2019-04-22T17:27:12.986540252Z caller=manager.go:211 component="Prometheus reader" msg="reached first record after start offset" start_offset=575790334617 skipped_records=239572
level=info ts=2019-04-23T15:08:50.277597295Z caller=manager.go:258 component="Prometheus reader" msg="Done processing WAL."
level=info ts=2019-04-23T15:08:50.410896868Z caller=main.go:426 msg="Prometheus reader stopped"
level=info ts=2019-04-23T15:08:50.425953269Z caller=main.go:432 msg="Stopping Prometheus reader..."
level=info ts=2019-04-23T15:08:50.426098956Z caller=queue_manager.go:233 component=queue_manager msg="Stopping remote storage..."
level=info ts=2019-04-23T15:08:50.456688366Z caller=queue_manager.go:241 component=queue_manager msg="Remote storage stopped."
level=error ts=2019-04-23T15:08:50.500115234Z caller=main.go:480 err="corruption after 16809492480 bytes: unexpected non-zero byte in padded page"
level=info ts=2019-04-23T15:08:50.501049227Z caller=main.go:482 msg="See you next time!"

Any help as to why that might be? thanks!

Add filtering machanism

We know of users of the existing integration that filter collected metrics for cost control.
The GA version should allow for the same kind of controls. Users could filter the data stored by the main Prometheus server but mutating their existing setup is not a good experience in this case.

The simplest approach is probably a repeated flag that allows to set Prometheus-style label matchers, which must all pass, e.g.

./stackdriver-prometheus-sidecar \
    --filter='job="kubelet"' \
    --filter='__name__!~="cadvisor.*"' \
    ...

Change config files to command-line flags

Improve test for tail package.

To improve the test coverage for tail package so we can capture issue like #160 in the future.

Investigate garbage collection failed error, "find last checkpoint: not found"

I'm running the sidecar against Prometheus head and seeing this error. Otherwise it seems to work. @fabxc If you're busy I can investigate. Do you have any leads?

level=info ts=2018-08-10T22:03:39.372102581Z caller=manager.go:190 component="Prometheus reader" msg="reached first record after start offset" start_offset=0 skipped_records=0
level=error ts=2018-08-10T22:04:39.345471209Z caller=series_cache.go:150 component="Prometheus reader" msg="garbage collection failed" err="find last checkpoint: not found"
[... repeat a few times ...]
level=info ts=2018-08-10T22:25:12.991434466Z caller=compact.go:398 component=tsdb msg="write block" mint=1533938400000 maxt=1533939300000 ulid=01CMJZ34758D06WSPQYAT93VQZ
level=info ts=2018-08-10T22:25:13.005999664Z caller=head.go:446 component=tsdb msg="head GC completed" duration=1.954377ms
level=error ts=2018-08-10T22:25:39.345561556Z caller=series_cache.go:150 component="Prometheus reader" msg="garbage collection failed" err="find last checkpoint: not found"

Error sending metrics with DNS service discovery

Issue

After updating a Prometheus stateful set manifest to support this sidecar, with debug logging enabled and a filter that sends metrics to Stackdriver for a Cassandra cluster only:

No data has been sent to Stackdriver yet
Sidecar debug logs shows:

level=debug ts=2019-04-04T20:08:08.329734376Z caller=series_cache.go:361 component="Prometheus reader" msg="unknown resource" labels="{instance=\"cassandra-0.cassandra-svc.staging.svc.cluster.local:7090\", job=\"cassandra\"}"
level=debug ts=2019-04-04T20:08:08.329734376Z caller=series_cache.go:361 component="Prometheus reader" msg="unknown resource" labels="{instance=\"cassandra-1.cassandra-svc.staging.svc.cluster.local:7090\", job=\"cassandra\"}"
level=debug ts=2019-04-04T20:08:08.329734376Z caller=series_cache.go:361 component="Prometheus reader" msg="unknown resource" labels="{instance=\"cassandra-2.cassandra-svc.staging.svc.cluster.local:7090\", job=\"cassandra\"}"
...
level=info ts=2019-04-04T22:16:58.428704742Z caller=manager.go:211 component="Prometheus reader" msg="reached first record after start offset" start_offset=101889806 skipped_records=228926

Environment, Images and Names

Environment: GKE
Prometheus image: 2.6.1
stackdriver-prometheus-sidecar image: 0.4.1
Prometheus stateful set name: prometheus-dev (one replica only)
Pod name: prometheus-dev-0

The Kubernetes cluster has Stackdriver Kubernetes Monitoring (beta) enabled.

Prometheus Stateful Set Manifest

apiVersion: apps/v1
kind: StatefulSet
metadata:
  labels:
    app: prometheus-dev
  name: prometheus-dev
spec:
  podManagementPolicy: OrderedReady
  replicas: 1
  selector:
    matchLabels:
      app: prometheus-dev
  serviceName: prometheus-dev-lb
  template:
    metadata:
      labels:
        app: prometheus-dev
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values:
                  - prometheus-dev
              topologyKey: failure-domain.beta.kubernetes.io/zone
            weight: 100
          - podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values:
                  - prometheus-dev
              topologyKey: kubernetes.io/hostname
            weight: 50
      containers:
      - args:
        - --stackdriver.project-id=<redacted>
        - --prometheus.wal-directory=/prometheus/wal
        - --prometheus.api-address=http://127.0.0.1:9090/
        - --stackdriver.kubernetes.location=us-central1
        - --stackdriver.kubernetes.cluster-name=<redacted>
        - --filter=job="cassandra"
        - --log.level=debug
        image: gcr.io/stackdriver-prometheus/stackdriver-prometheus-sidecar:0.4.1
        imagePullPolicy: Always
        name: sidecar
        ports:
        - containerPort: 9091
          name: sidecar
          protocol: TCP
        volumeMounts:
        - mountPath: /prometheus
          name: datadir
      - image: prom/prometheus:v2.6.1
        imagePullPolicy: Always
        livenessProbe:
          failureThreshold: 3
          initialDelaySeconds: 35
          periodSeconds: 15
          successThreshold: 1
          tcpSocket:
            port: 9090
          timeoutSeconds: 5
        name: prometheus
        ports:
        - containerPort: 9090
          name: web-port
          protocol: TCP
        readinessProbe:
          failureThreshold: 3
          initialDelaySeconds: 30
          periodSeconds: 10
          successThreshold: 1
          tcpSocket:
            port: 9090
          timeoutSeconds: 5
        volumeMounts:
        - mountPath: /prometheus
          name: datadir
        - mountPath: /etc/prometheus/prometheus.yml
          name: prometheus-config-file
          subPath: prometheus.yml
      nodeSelector:
        pool: mon
      restartPolicy: Always
      securityContext:
        fsGroup: 99
      terminationGracePeriodSeconds: 30
      tolerations:
      - effect: NoSchedule
        key: pool
        operator: Equal
        value: mon
      volumes:
      - configMap:
          defaultMode: 420
          name: prometheus-config
        name: prometheus-config-file
  updateStrategy:
    type: RollingUpdate
  volumeClaimTemplates:
  - metadata:
      name: datadir
    spec:
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: 100Gi
      storageClassName: ssd

Prometheus scrape config

      - job_name: cassandra

        scrape_interval: 15s
        scrape_timeout: 10s

        dns_sd_configs:
        - names:
          - _prometheus-jmx._tcp.cassandra-svc

Logs

Looking at the sidecar log output, it looks good initially:

level=info ts=2019-04-04T18:17:33.260259902Z caller=main.go:256 msg="Starting Stackdriver Prometheus sidecar" version="(version=0.4.0, branch=master, revision=3c176b3f5c58e85645e598de5b82f95dca814497)"
level=info ts=2019-04-04T18:17:33.260731139Z caller=main.go:257 build_context="(go=go1.12, user=kbuilder@kokoro-gcp-ubuntu-prod-283836442, date=20190325-14:24:29)"
level=info ts=2019-04-04T18:17:33.260857444Z caller=main.go:258 host_details="(Linux 4.14.65+ #1 SMP Sun Sep 9 02:18:33 PDT 2018 x86_64 prometheus-dev-0 (none))"
level=info ts=2019-04-04T18:17:33.260995428Z caller=main.go:259 fd_limits="(soft=1048576, hard=1048576)"
level=info ts=2019-04-04T18:17:33.265969971Z caller=main.go:463 msg="Web server started"
level=info ts=2019-04-04T18:17:33.277646916Z caller=main.go:444 msg="Stackdriver client started"

but then once Prometheus scrapes and the sidecar attempts to operate (as noted above), there are multiple instances of this unknown resource message (for each cassandra node):

level=debug ts=2019-04-04T20:43:05.287691058Z caller=series_cache.go:361 component="Prometheus reader" msg="unknown resource" labels="{instance=\"cassandra-2.cassandra-svc.staging.svc.cluster.local:7090\", job=\"cassandra\"}"

Notes

The cassandra job is good and all Cassandra pods are UP in Prometheus
Expected metrics being exported from Cassandra are present in Prometheus
In Prometheus UI, as an example Cassandra Label:

instance="cassandra-2.cassandra-svc.staging.svc.cluster.local:7090" job="cassandra"

Questions

(1) Filtering documentation

Per the readme, I originally attempted to use:

--filter='job="cassandra"'

But that caused:

level=error ts=2019-04-04T18:48:53.374995353Z caller=main.go:291 msg="Error parsing filters" err="invalid filter \"'job=\\\"cassandra\\\"'\""

I now have:

--filter=job="cassandra"

which appears to be ok - is this a documentation bug in your readme?

(2) Shared volume

We store Prometheus data in a a persistent volume:

        volumeMounts:
        - mountPath: /prometheus
          name: datadir

 volumeClaimTemplates:
  - metadata:
      name: datadir
    spec:
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: 100Gi
      storageClassName: ssd

And so for the sidecar, added:

       volumeMounts:
        - mountPath: /prometheus
          name: datadir

Does this setup look right? I assume so as the logs didn't complain but wanted to make sure things were good as far as sidecar requirements for writing to a shared volume, as noted here.

(3) Getting data into Stackdriver

Guessing the unknown resource log messages noted already are the reason that no data is getting sent by the sidecar to Stackdriver. Any help is much appreciated.

Support EC2 targets

Can't use stackdriver container

Hi,

I have been trying to make this work for a while now and i've come to a few conclusions.

I can't run any of the containers present in your public registry. Both locally and inside the clusters attempting to run your container results in this:
standard_init_linux.go:190: exec user process caused "permission denied"

If I go and make the image myself, using make docker in the root directory, the generated container no longer has the error and I can start it up.

However, running your container from the gcloud console works and i'm able to get that one up straight away.

Docker info for local:

Containers: 56
 Running: 0
 Paused: 0
 Stopped: 56
Images: 1173
Server Version: 18.06.1-ce
Storage Driver: overlay2
 Backing Filesystem: extfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version:  (expected: 468a545b9edcd5932818eb9de8e72413e616e86e)
runc version: N/A (expected: 69663f0bd4b60df09991c08812a60108003fa340)
init version: v0.18.0 (expected: fec3683b971d9c3ef73f284f176672c44b448662)
Security Options:
 apparmor
 seccomp
  Profile: default
Kernel Version: 4.15.0-43-generic
Operating System: Ubuntu 18.04.1 LTS
OSType: linux
Architecture: x86_64
CPUs: 12
Total Memory: 31.25GiB
Name: xxx
ID: HMKY:IVL6:OCB6:3OU5:PDDK:2XB7:2DJX:Q35N:D6DL:Z4ZB:JZBW:6DLF
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

WARNING: No swap limit support

Docker info for gcp console:

Containers: 1
 Running: 0
 Paused: 0
 Stopped: 1
Images: 1
Server Version: 18.03.1-ce
Storage Driver: overlay2
 Backing Filesystem: extfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 773c489c9c1b21a6d78b5c538cd395416ec50f88
runc version: 4fc53a81fb7c994640722ac585fa9ca548971871
init version: 949e6fa
Security Options:
 apparmor
 seccomp
  Profile: default
Kernel Version: 4.14.74+
Operating System: Debian GNU/Linux 9 (stretch) (containerized)
OSType: linux
Architecture: x86_64
CPUs: 1
Total Memory: 1.656GiB
Name: cs-6000-devshell-vm-382718db-b6dd-4808-aaec-daaec04bd062
ID: 74SW:2GLN:6MQ2:IKIY:IGYQ:ZRRP:BSX3:JHHR:ALJO:LD3H:DLKH:SGI5
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
 127.0.0.0/8
Registry Mirrors:
 https://eu-mirror.gcr.io/
Live Restore Enabled: false

The ideal scenario would be being able to simply pull your container and run it.
Have you seen this issue before?

Thanks.

some metrics not being updated

Hi, I have configured stackdriver-prometheus sidecar with the following args:

args:
- --stackdriver.project-id=<redacted>
- --prometheus.wal-directory=/prometheus/wal
- --stackdriver.kubernetes.location=us-central1
- --stackdriver.kubernetes.cluster-name=<redacted>
- --filter=stackdriver_export="true"

For the purposes of testing, I've labelled two metrics with 'stackdriver_export="true"'.

external.googleapis.com/prometheus/kube_pod_container_status_waiting_reason
external.googleapis.com/prometheus/node_load1

If I query prometheus, I can see timeseries data like:

kube_pod_container_status_waiting_reason{container="example-app",endpoint="http",instance="10.20.3.12:8080",job="kube-state-metrics",namespace="default",pod="example-app-b7fbf9fd9-8fnm8",reason="ImagePullBackOff",service="prom-kube-state-metrics",stackdriver_export="true"}	1

^ there are metrics like these for every pod.

But, when I query stackdriver directly, despite finding out that the metric type itself has been correctly created, I only see metrics for kube-state-metrics "pods", and the labels seem to be indicating that the "pod_name" is something that doesn't even exist. There are also only a few of them, and none of them indicate pods that are in a wait state, even though I have forced some into such a state and can see the metrics in prometheus.
example:

Metric: external.googleapis.com/prometheus/kube_pod_container_status_waiting_reason
Label: container=kube-state-metrics
Label: reason=ImagePullBackOff
Label: stackdriver_export=true
Resource: k8s_container
Label: container_name=kube-state-metrics
Label: namespace_name=monitor
Label: location=us-central1
Label: project_id=<redacted>
Label: pod_name=prom-op-kube-state-metrics-76786cc9b4-dgph9
Label: cluster_name=<redacted>
Point: [1554410721-1554410721] = 0
Point: [1554410661-1554410661] = 0
Point: [1554410601-1554410601] = 0
Point: [1554410541-1554410541] = 0
Point: [1554410481-1554410481] = 0
Point: [1554410421-1554410421] = 0
Point: [1554410361-1554410361] = 0
Point: [1554410301-1554410301] = 0
Point: [1554410241-1554410241] = 0
Point: [1554410181-1554410181] = 0
Point: [1554410121-1554410121] = 0
Point: [1554410061-1554410061] = 0
Point: [1554410001-1554410001] = 0
Point: [1554409941-1554409941] = 0
Point: [1554409881-1554409881] = 0
Point: [1554409821-1554409821] = 0
Point: [1554409761-1554409761] = 0
Point: [1554409701-1554409701] = 0
Point: [1554409641-1554409641] = 0
Point: [1554409581-1554409581] = 0
Point: [1554409521-1554409521] = 0
Point: [1554409461-1554409461] = 0
Point: [1554409401-1554409401] = 0
Point: [1554409341-1554409341] = 0
Point: [1554409281-1554409281] = 0
Point: [1554409221-1554409221] = 0
Point: [1554409161-1554409161] = 0
Point: [1554409101-1554409101] = 0
Point: [1554409041-1554409041] = 0
Point: [1554408981-1554408981] = 0
Point: [1554408921-1554408921] = 0
Point: [1554408861-1554408861] = 0
Point: [1554408801-1554408801] = 0
Point: [1554408741-1554408741] = 0
Point: [1554408681-1554408681] = 0
Point: [1554408621-1554408621] = 0
Point: [1554408561-1554408561] = 0
Point: [1554408501-1554408501] = 0
Point: [1554408441-1554408441] = 0
Point: [1554408381-1554408381] = 0
Point: [1554408321-1554408321] = 0
Point: [1554408261-1554408261] = 0
Point: [1554408201-1554408201] = 0
Point: [1554408141-1554408141] = 0
Point: [1554408081-1554408081] = 0
Point: [1554408021-1554408021] = 0
Point: [1554407961-1554407961] = 0
Point: [1554407901-1554407901] = 0
Point: [1554407841-1554407841] = 0
Point: [1554407781-1554407781] = 0
Point: [1554407721-1554407721] = 0
Point: [1554407661-1554407661] = 0
Point: [1554407601-1554407601] = 0
Point: [1554407541-1554407541] = 0
Point: [1554407481-1554407481] = 0
Point: [1554407421-1554407421] = 0
Point: [1554407361-1554407361] = 0
Point: [1554407301-1554407301] = 0
Point: [1554407241-1554407241] = 0
Point: [1554407181-1554407181] = 0

I feel like I'm running into some kind of label re-write issue, maybe? Wondering if anyone can shed some light as to why the prometheus metrics are not being either properly exported, or maybe not properly ingested. There are no log messages from the sidecar itself to indicate a problem.

add prometheus operator docs

This sidecar seems to work with the prometheus operator (once I figure out the service account #136).

See https://github.com/merklecounty/rget/blob/master/production/prometheus.yaml#L38-L65

Project fails to build due to https://git.apache.org/ being down

make
>> formatting code
GO111MODULE=on go fmt ./...
go: finding git.apache.org/thrift.git v0.0.0-20180902110319-2566ecd5d999
go: git.apache.org/[email protected]: git fetch -f https://git.apache.org/thrift.git refs/heads/*:refs/heads/* refs/tags/*:refs/tags/* in /home/mans0954/go/pkg/mod/cache/vcs/83dba939f95a790e497d565fc4418400145a1a514f955fa052f662d56e920c3e: exit status 128:
	fatal: unable to access 'https://git.apache.org/thrift.git/': Failed to connect to git.apache.org port 443: Connection timed out
go: error loading module requirements
Makefile:81: recipe for target 'format' failed
make: *** [format] Error 1

I appreciate that the unavailability of git.apache.org is outside of this projects control, but Apache seem to be moving towards using GitHub for VCS [1], and the installation instructions for thrift now refer to the GitHub url [2], so it may be advisable to add a replace statement to go.mod.

[1] https://blogs.apache.org/foundation/entry/the-apache-software-foundation-expands
[2] https://github.com/apache/thrift/tree/master/lib/go#using-thrift-with-go

Warning: time series could not be written

I have installed prometheus-operator and have injected the sidecart. I can see metrics being written to stackdriver but i am also seeing the this warning

level=warn ts=2019-02-03T21:34:23.166006593Z caller=queue_manager.go:546 component=queue_manager msg="Unrecoverable error sending samples to remote storage" err="rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: Metric kind for metric external.googleapis.com/prometheus/go_gc_duration_seconds_sum must be GAUGE, but is CUMULATIVE.: timeSeries[10]"

Is this an issue with the application generating the metric

How often does the sidecar send metrics to stackdriver?

Unless I've missed it in the README and the google docs page, don't think this is documented.

Just wondering how often does this sidecar send metrics to the stackdriver endpoint?

Thanks!

Anderson

Add Google Application Default Credentials to kube setup

The "kube" test environment runs Prometheus and the sidecar in Kubernetes. Prometheus server will need ADC to scrape targets in GCE (we will need something similar for AWS). The sidecar will need ADC for sending data to Stackdriver if running in Kubernetes outside GKE, but this is out of scope for this issue, as we currently only run the sidecar in GKE for testing.

Instructions:

Example for GCE: b75cbaf

Filter regex

The filter is failing for 0.4.0 even with the examples given in the README
Error parsing filters: invalid filter "'app=\"some-app\"'"

The arg in the kubernetes manifest is like:

        args:
        - --filter='app="some-app"'

By looking at the code you expect 4 groups, but I can only see 3, so maybe that's it?
https://github.com/Stackdriver/stackdriver-prometheus-sidecar/blob/master/cmd/stackdriver-prometheus-sidecar/main.go#L538

Sidecar API retries

Hi friends,

I've tested the sidecar successfully for 24 hours.
There's a couple of takeaways from this experience.

The implementation I was attempting was at a fairly large scale, there are 4 instances of prometheus per cluster with 6 clusters in total. This gives us a total of 24 sidecars active.

In 24 hours, these sidecars were responsible for a whopping amount of 82,099,610 API calls.
This is an absolutely unacceptable amount.
Assuming that we would average the number of metrics between instances, this translates into 3420817 per sidecar per day.

The aggregated cost of these calls (and only the API calls) was 820.99 £.

One could argue that reducing the amount of unused income metrics would be an efficient way to reduce cost (and would be right), however some of the retrying done by the sidecar is absolutely unreasonable.

Example 1:
Metric name checks are made by sending calls to the API and then checking the error response.
The regex for the name is available in the google docs, so, there is absolutely no reason why the checks can't be done locally and save everyone using your software some cash.

Example 2:
Tremendously aggressive retrying. Surely if we weren't able to upload a metric the result is not likely to change suddenly, my suggestion here would be to add exponential timeouts. 1 sec before 1st retry, 2 sec before 2nd, 4 sec before 3rd, 8 sec before 4th, etc.. (up to a certain point where it would cap)

Example 3:
There is a restriction of 1 time series per minute for a specific metric imposed by GCP, which means that if varied exporters scrape the endpoints at the rate of 3 times per minute and report on 25 metrics (20 seconds between scrapes is not that uncommon), stackdriver will make 25 valid API calls and 50 invalid ones. If you extrapolate the magnitude of metrics, this becomes absolutely unbearable and a waste of resources.

As an example of what i'm talking about here, here's the API statistics for the 24 hours:

API: Stackdriver Monitoring API
Number of requests: 82,099,610
% of Error API CALLS: 89

This means that out of the 82 million calls in 1 day, 9 million were actually successful.

I intend to help fix some of these problems when I get a chance to contribute to your project, just thought i'd raise these so that you are aware or at least that you put a warning in the documentation of this project so that people can do their research on pricing and the metrics they have before deploying it.

Thank your for your time and effort.
Miguel

stackdriver / stackdriver-prometheus-sidecar Goto Github PK