Giter VIP home page Giter VIP logo

xm-labs-prometheus's Introduction

Prometheus AlertManager

Prometheus is a powerful, open-source monitoring solution. This integration to xMatters extends the alerting capabilities of AlertManager to notify the right people at the right time.

Prometheus video



An updated version of this integration is available. You can install the new one-way version right from the Workflow Templates directory within your xMatters instance. Learn more.


Pre-Requisites

  • Prometheus with AlertManager set up and running.
  • An application to monitor
  • xMatters account - If you don't have one, get one!
  • An xMatters Agent or an open port to AlertManager that xMatters can access from the cloud.

Files

  • Prometheus.zip - Workflow for the integration builder script and notification form templates.

How it works

Alerting rules are defined in Prometheus and sent to AlertManager for further processing. The AlertManager config file defines what happens after the alerts are sent to AlertManager. A webhook points to an HTTP trigger in xMatters. Once the alert reaches xMatters, the integration builder script transforms the content and builds the event, sets the recipient to the receiver and creates the event.

Installation

xMatters set up

  1. Login to the xMatters UI and navigate to the Workflows page.
  2. Click the Import Workflow and select the Prometheus.zip file.
  3. Update the Alert Manager Endpoint to the address of your alertmanager, i.e. http://localhost:9093/api/v2/
  4. Edit the Run Location for the two silence steps to point to an agent, or the open port for AlertManager.

Steps:

Run Location:

Prometheus set up

  1. Open the alertmanager.yml file and navigate to the receivers section. The location of the file and the section will depend on the details of the installation.
  2. Add a new receiver. The name of the receiver will be the recipients of the event. The webhook url is found in the Inbound from Alertmanager step in your xMatters workflow. For example, to target the Database group:
- name: 'Database'
  webhook_configs:
    - url: 'https://acme.xmatters.com/api/integration/1/functions/UUID/triggers?apiKey=KEY'

Note: By storing the API Key in the URL it is visible in the UI. If you would like to keep the API Key out of the UI, use http_config and basic authentication to connect with xMatters.

  1. Edit the route that should target the new receiver. For example, to notifiy this Database receiver for the octoapp service:
  routes:
  - match_re:
      service: ^(octoapp)$
    receiver: Database
  1. Repeat as needed for new routes and new receivers.

  2. Edit any alert rules (referenced in the file(s) defined in the rule_files section of the prometheus.yml file) to include a priority annotation, or to include any additional fields required for processing. For example:

groups:
- name: alert.rules
  rules:
  - alert: octo_alert
    expr: some_gauge > 30
    for: 20s
    labels:
      service: octoapp
      severity: page_octo
    annotations:
      description: The description goes here
      summary: The summary goes here
      recipient: bob

The fields inside the ANNOTATIONS section are put inside the annotation_contents output.

Include an annotation called recipient for xMatters to know who to alert

Testing

Create or edit an Alert Rule in the alert rules file (defined in the prometheus.yml file) that is easy to fire. For example, to fire when the widget_gauge is greater than 30 for 1 minute:

groups:
- name: alert.rules
  rules:
  - alert: octo_alert
    expr: some_gauge > 30
    for: 20s
    labels:
      service: octoapp
      severity: page_octo
    annotations:
      description: The description goes here
      summary: The summary goes here
      recipient: bob

Then in the monitored application, get the some_gauge value above 30 for 1 minute. This will trigger an alert in AlertManager, and then will be fired off to xMatters. Make sure you have a Database group with a user.

A notification will be sent out targeting the Database group:

Troubleshooting

Check the AlertManager log (This depends on installation details) for any errors making the call to xMatters. Then check the Activity Stream in the Inbound from Prometheus section for errors.

Make sure a recipient annotation is set in the alert rule that is triggered.

Check that the HTTP trigger in xMatters is associated with the receiver in your alertmanager.yml file.

Example

This is the example flow provided in the Prometheus.zip Workflow file.

xm-labs-prometheus's People

Contributors

castlexm avatar ipugh-xm avatar reidan avatar xmsteele avatar xmtinkerer avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

xm-labs-prometheus's Issues

Inbound from Prometheus issue

how to update Inbound integration script for Prometheus to make it work well with "group_by" alerts?

My body:

{
    "receiver": "xmatter-alert",
    "status": "firing",
    "alerts": [
        {
            "status": "firing",
            "labels": {
                "alertname": "ProbeSSLCertExpiryCritical",
                "env": "prod-gp2-om",
                "environment": "prometheus-master",
                "instance": "https://api.net/v2/info",
                "service": "probe",
                "severity": "critical"
            },
            "annotations": {
                "description": "The SSL certificate at endpoint `https://api.net/v2/info` will expire in 11d 20h 33m 53s",
                "summary": "Endpoint `https://api.net/v2/info` SSL certificate will expire in 11d 20h 33m 53s"
            },
            "startsAt": "2019-10-23T15:26:06.441177639Z",
            "endsAt": "0001-01-01T00:00:00Z",
            "generatorURL": "https://prometheus.net/graph?g0.expr=avg+by%28instance%2C+env%29+%28probe_ssl_earliest_cert_expiry%29+-+time%28%29+%3C+1.8144e%2B06&g0.tab=1",
            "fingerprint": "a5485df3cf855aef"
        },
        {
            "status": "firing",
            "labels": {
                "alertname": "ProbeSSLCertExpiryCritical",
                "env": "prod-gp2-om",
                "environment": "prometheus-master",
                "instance": "https://login.net/login",
                "service": "probe",
                "severity": "critical"
            },
            "annotations": {
                "description": "The SSL certificate at endpoint `https://login.net/login` will expire in 11d 20h 33m 53s",
                "summary": "Endpoint `https://login.net/login` SSL certificate will expire in 11d 20h 33m 53s"
            },
            "startsAt": "2019-10-23T15:26:06.441177639Z",
            "endsAt": "0001-01-01T00:00:00Z",
            "generatorURL": "https://prometheus.net/graph?g0.expr=avg+by%28instance%2C+env%29+%28probe_ssl_earliest_cert_expiry%29+-+time%28%29+%3C+1.8144e%2B06&g0.tab=1",
            "fingerprint": "a3a03dc78143ad60"
        },
        {
            "status": "firing",
            "labels": {
                "alertname": "ProbeSSLCertExpiryCritical",
                "env": "prod-gp2-om",
                "environment": "prometheus-master",
                "instance": "https://uaa.net/login",
                "service": "probe",
                "severity": "critical"
            },
            "annotations": {
                "description": "The SSL certificate at endpoint `https://uaa.net/login` will expire in 11d 20h 33m 53s",
                "summary": "Endpoint `https://uaa.net/login` SSL certificate will expire in 11d 20h 33m 53s"
            },
            "startsAt": "2019-10-23T15:26:06.441177639Z",
            "endsAt": "0001-01-01T00:00:00Z",
            "generatorURL": "https://prometheus.net/graph?g0.expr=avg+by%28instance%2C+env%29+%28probe_ssl_earliest_cert_expiry%29+-+time%28%29+%3C+1.8144e%2B06&g0.tab=1",
            "fingerprint": "c01b393165680508"
        }
    ],
    "groupLabels": {
        "alertname": "ProbeSSLCertExpiryCritical"
    },
    "commonLabels": {
        "alertname": "ProbeSSLCertExpiryCritical",
        "env": "prod-gp2-om",
        "environment": "prometheus-master",
        "service": "probe",
        "severity": "critical"
    },
    "commonAnnotations": {},
    "externalURL": "https://alertmanager.net",
    "version": "4",
    "groupKey": "{}/{severity=\"critical\"}:{alertname=\"ProbeSSLCertExpiryCritical\"}"
}

What i get in Inbox:
Screenshot from 2019-10-23 18-50-58

I want to get Annotations for all alerts like:
11111111111111111111

How to make 2 events from one body

I got Body form Prometheus

{"receiver":"xmatters-receiver","status":"firing","alerts":[{"status":"firing","labels":{"alertname":"BOSHJobEphemeralDiskFull","bosh_deployment":"concourse","bosh_job_index":"0","bosh_job_name":"worker-pcf-backup","bosh_name":"gp2-dev-infraservices","env":"dev-gp2","environment":"dev-gp2","service":"bosh-job","severity":"critical"},"annotations":{"description":"BOSH Job `dev-gp2/gp2-dev-infraservices/concourse/worker-pcf-backup/0` has used more than 80% of its ephemeral disk for 30ms: 90%","summary":"BOSH Job `dev-gp2/gp2-dev-infraservices/concourse/worker-pcf-backup/0` is running out of ephemeral disk"},"startsAt":"2018-11-16T09:02:16.538555134Z","endsAt":"2018-11-17T09:31:16.538555134Z","generatorURL":"https://prometheus-dev.gp2.axadmin.net/graph?g0.expr=avg+by%28environment%2C+bosh_name%2C+bosh_deployment%2C+bosh_job_name%2C+bosh_job_index%29+%28bosh_job_ephemeral_disk_percent%7Bbosh_deployment%21%3D%22bosh-health-check%22%2Cbosh_job_name%21~%22%5Ecompilation.%2A%22%7D%29+%3E+80\u0026g0.tab=1"},{"status":"firing","labels":{"alertname":"BOSHJobEphemeralDiskFull","bosh_deployment":"shield-v8","bosh_job_index":"0","bosh_job_name":"dedicated_backupnode","bosh_name":"gp2-dev-infraservices","env":"dev-gp2","environment":"dev-gp2","service":"bosh-job","severity":"critical"},"annotations":{"description":"BOSH Job `dev-gp2/gp2-dev-infraservices/shield-v8/dedicated_backupnode/0` has used more than 80% of its ephemeral disk for 30ms: 96%","summary":"BOSH Job `dev-gp2/gp2-dev-infraservices/shield-v8/dedicated_backupnode/0` is running out of ephemeral disk"},"startsAt":"2018-11-15T23:02:16.538555134Z","endsAt":"2018-11-17T09:31:16.538555134Z","generatorURL":"https://prometheus-dev.gp2.axadmin.net/graph?g0.expr=avg+by%28environment%2C+bosh_name%2C+bosh_deployment%2C+bosh_job_name%2C+bosh_job_index%29+%28bosh_job_ephemeral_disk_percent%7Bbosh_deployment%21%3D%22bosh-health-check%22%2Cbosh_job_name%21~%22%5Ecompilation.%2A%22%7D%29+%3E+80\u0026g0.tab=1"}],"groupLabels":{},"commonLabels":{"alertname":"BOSHJobEphemeralDiskFull","bosh_job_index":"0","bosh_name":"gp2-dev-infraservices","env":"dev-gp2","environment":"dev-gp2","service":"bosh-job","severity":"critical"},"commonAnnotations":{},"externalURL":"https://alertmanager-dev.gp2.axadmin.net","version":"4","groupKey":"{}/{alertname=~\"^(?:BOSHJobEphemeralDiskFull)$\"}:{}"}

and Xmatters create only one Event for it with:

Prometheus Alert: 2 firing for
[FIRING:2] BOSHJobEphemeralDiskFull bosh-job(alertname = BOSHJobEphemeralDiskFull bosh_deployment = concourse bosh_job_index = 0 bosh_job_name = worker-pcf-backup bosh_name = gp2-dev-infraservices env = dev-gp2 environment = dev-gp2 service = bosh-job severity = critical )

How to make it a separate events for each Alert?

Support auto-resolving of alerts

Currently, enabling the send_resolved flag just creates a new alert with RESOLVED in the title. It would be nice if this resolved an existing alert.

Deduplicate alerts

Hey,

Currently, each alert (or set of alerts) creates a new notification. It would be nice if the duplicates don't create a new alert every time. Maybe a UUID hash of the groupKey?

My version of AlertManager is sending far different json to matters than is presumed from the current instructions

Hi was wondering if anyone knows if there was any major change in what would get sent by AlertManager. I followed the documentation and imported the prometheus plan from this GitHub project but what gets sent from AlertManager is vastly different (Using Prom 2.12.0 and AlertMgr 0.19.0. The docs didn't suggest that any real additions to fields had to be done to at least get the base to work. Using the curl example from xMatters allows me to create alerts just fine.

This is what I'm seeing sent from AlertManager by intercepting with request bin. I can make this work by essentially re-writing the integration from scratch but I'm curious if I'm just missing something here?

Received from AlertManager:

{
"receiver": "xmatters",
"status": "firing",
"alerts": [
{
"status": "firing",
"labels": {
"alertname": "InstanceUp",
"instance": "localhost:9090",
"job": "prometheus",
"service": "Check Prometheus"
},
"annotations": {
"description": "The Node exporter service on the Prometheus server is running... ",
"summary": "The test node has come up"
},
"startsAt": "2019-09-11T17:08:45.277203128-05:00",
"endsAt": "0001-01-01T00:00:00Z",
"generatorURL": "http://fermi:9090/graph?g0.expr=up+%3D%3D+1&g0.tab=1",
"fingerprint": "ad34903b6ade0da2"
}
],
"groupLabels": {
"alertname": "InstanceUp",
"instance": "localhost:9090",
"job": "prometheus",
"service": "Check Prometheus"
},
"commonLabels": {
"alertname": "InstanceUp",
"instance": "localhost:9090",
"job": "prometheus",
"service": "Check Prometheus"
},
"commonAnnotations": {
"description": "The Node exporter service on the Prometheus server is running... ",
"summary": "The test node has come up"
},
"externalURL": "http://fermi:9093",
"version": "4",
"groupKey": "{}:{alertname="InstanceUp", instance="localhost:9090", job="prometheus", service="Check Prometheus"}"
}

When that gets sent to matters, I get a code 400 error and it does not show up as an alert.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.