stefanprodan / swarmprom Goto Github PK

View Code? Open in Web Editor NEW

1.9K 66.0 734.0 5.79 MB

Docker Swarm instrumentation with Prometheus, Grafana, cAdvisor, Node Exporter and Alert Manager

License: MIT License

Shell 57.92% Dockerfile 42.08%

docker swarm prometheus grafana cadvisor traefik

swarmprom's Introduction

Stefan Prodan's Blog

Made with Material for MkDocs

swarmprom's People

Contributors

Stargazers

Watchers

Forkers

mewzherder cmendesce mattorb kharloss falmar mgd1981 bxtp4p ajeetraina rajivece patric69 miqui lincolnhedgehog koenraadm rajboruah longxuanho pmcpsantana ajaegle santhosh13nov ursforrer mittyok javiervivanco edmundkwok eeddaann satishsverma kozharsky kevin71020 ssl2017 loversama rms1000watt veritone igorkatz ahromis tarasinf mario21ic ehurmuzov ivanyinusa war-labs defcyy zhaokai021 tblazz digital-stoic dunguyenn masterxavierfox christiankniep abhisheks-cuelogic gnulux intergral eyolas derytim jvigneux yholkamp stretchcloud raymondmouthaan quynhdang-vt jmaitrehenry hastarin junoteam kuiche1982 jankatins mpetyx silverstory siso mikewolfd circleofnice mosunday opera443399 orubel shoptagr fernandobsb danielpalstra lirany1 wagnerm dariusj18 vbsinterestingstuff kiddo3 mishamx kraunikumar st2labs ycyr sangkyunyoon openbankingresearch zeroc0d3 ptsiampas swift1911 wwwaheb eagafonov deepsonune semantixinfinitepossibilities vvoloshchak tperelle danielleparisien terasaka jordiromancastells iamjagan durga61 develar zironycho norsig svendowideit onaci

swarmprom's Issues

How to get the domain name for instances at prometheus dns_sd_configs configuration

At the Part of Prometheus service discovery, the names configured for DNS discovery are formed as tasks.<servicename>:

scrape_configs:
  - job_name: 'node-exporter'
    dns_sd_configs:
    - names:
      - 'tasks.node-exporter'
      type: 'A'
      port: 9100

The names should be <domain_name>. I have no idea of how does the form of tasks.<servicename> come from. Is it comes from your DNS configuration or docker swarm mode discovery?

Swarm services dashboard is not showing services running on the manager node.

Hi,

My cluster has 2 nodes, 1 manager and 1 worker.

In the swarm node dashboard I can see details for all the nodes (except for CPU usage for both the nodes, is it normal?)

In the swarm services dashboard, I 'm only seeing details from my worker node. When I explicitly select the master node, I don't see anything. As if it's not reading anything from my master.

Instance down

Have you ever tried creating a rule like if the node went down then it will throw an alert?

Swarm service dashboard show only Manager node service

I have a 4 node swarm and the service dashboard show only the service from the manager.
Also, it say I have only 1 node.
But if I go to the node dashboard I can see all my 4 nodes.

node-exporter doesn't capture network traffic

In the current stack the node-exporter services cannot capture the network traffic stats since they aren't attached to the host network.

If one does switch to use the host network then it works fine again but Prometheus cannot discover the exporters anymore.

Is there a way to support both discover and host networking or i have to choose between the two features when using this stack?

templating error when I log in ...

when I log into :3000 at first I get a templating error ...

Templating init failed
[object Object]

api/datasources/proxy/1/api/v1/query_range?query=sum(irate(node_cpu%7Bmode%3D%22idle%22%7D%5B30s%5D)%20*%20on(instance)%20group_left(node_name)%20node_meta%7Bnode_id%3D~%22.%2B%22%7D)%20*%20100%20%2F%20count_scalar(node_cpu%7Bmode%3D%22user%22%7D%20*%20on(instance)%20group_left(node_name)%20node_meta%7Bnode_id%3D~%22.%2B%22%7D)%20&start=1512715040&end=1512715100&step=1

Docker stack deploy settings are ignored on Docker CE v17.12

All values in this command will be ignored:

ADMIN_USER=admin \
ADMIN_PASSWORD=admin \
SLACK_URL=https://hooks.slack.com/services/TOKEN \
SLACK_CHANNEL=devops-alerts \
SLACK_USER=alertmanager \
docker stack deploy -c docker-compose.yml mon

Ref. e.g. "The same effect occurs without the env_file: .env line, or with "$FOOVAR" in the actual command.

Tested on this docker:

Client:
 Version:       17.12.0-ce
 API version:   1.35
 Go version:    go1.9.2
 Git commit:    c97c6d6
 Built: Wed Dec 27 20:11:19 2017
 OS/Arch:       linux/amd64

Server:
 Engine:
  Version:      17.12.0-ce
  API version:  1.35 (minimum version 1.12)
  Go version:   go1.9.2
  Git commit:   c97c6d6
  Built:        Wed Dec 27 20:09:53 2017
  OS/Arch:      linux/amd64
  Experimental: true

Does not work

grafana output

Thank you Stephan for this work. I have a question about grafana. I try to access it at "127.0.0.1:3000" but it gives me this page

I'm not able to access dashboards, or any other thing from grafana? I'm not sure what I did wrong?

One other question please, what should I do to access collected metric values programmatically in python? should I use a specific library? or should I forward collected metrics to a database, then access it from python?

Regards

Is there a way to exclude/disable series for services that are related to monitoring?

Is there a way to automatically exclude all the containers and services series that are related to monitoring itself? All that starts with mon_*

PS. Thanks for putting together this awesome dashboard.

Replace Caddy

Caddy is free only for personal projects https://caddyserver.com/products/licenses

Prometheus.yml needs to be pulled into Docker Configs

Generally speaking, SwarmProm is a great starting point. One issue we're running into implementing this solution, however, is that (at this point) there is no way to extend prometheus metrics to other things.

For instance, we would like to monitor Traefik (BTW as an Aside, you should look at replacing Caddy with Traefik in your stack...In my opinion, it's an easier to configure traffic router than Caddy, with less random config files...YMMV) with Prometheus.

However, when I go to pull prometheus.yml out (create a docker config file for it, add that config into the monitoring stack file) upon starting prometheus we're getting:

"mv: can't rename '/tmp/prometheus.yml': Device or resource busy"

Meaning prometheus appears to already be running by the time Docker attempts to mount the prometheus.yml file into /etc/prometheus.

The only way to add to the scrape configs at this point is to download your Dockerfile / prometheus.yml file and re-build the prometheus container...so the prometheus included in this stack cannot really be extended to monitor other things.

Help a guy out? There's got to be a way to externalize the prometheus.yml file so that it can come in from docker configs (like the rules files do).

Dockerd-exporters are always down

Good day. And thanks for the great project. I really admire this one.

I run your stack on cluster with 1 manager and 2 workers. Everything looks good, but in Prometheus dashboard I see the next one:

As you write here, I update /etc/docker/daemon.json and restart docker service:

{
  "experimental": true,
  "metrics-addr": "0.0.0.0:9323"
}

I check my DOCKER_GWBRIDGE_IP:

$ ip -o addr show docker_gwbridge

3: docker_gwbridge    inet 172.18.0.1/16 brd 172.18.255.255 scope global docker_gwbridge\       valid_lft forever preferred_lft forever

If I curl this endpoint with next IPs, everything works:

$ curl http://172.18.0.1:9323/metrics
$ curl http://0.0.0.0:9323/metrics
$ curl http://localhost:9323/metrics

But in Prometheus dockerd-exporter statuses are always down.

$ docker service logs mon_dockerd-exporter

mon_dockerd-exporter.0.ofok9t4isfk9@node-1    | Activating privacy features... done.
mon_dockerd-exporter.0.ofok9t4isfk9@node-1    | http://:9323
mon_dockerd-exporter.0.ofok9t4isfk9@node-1    | 03/Apr/2018:07:36:34 +0000 [ERROR 502 /metrics] context canceled
mon_dockerd-exporter.0.ofok9t4isfk9@node-1    | 03/Apr/2018:07:36:49 +0000 [ERROR 502 /metrics] context canceled
mon_dockerd-exporter.0.ofok9t4isfk9@node-1    | 03/Apr/2018:07:37:04 +0000 [ERROR 502 /metrics] context canceled
mon_dockerd-exporter.0.ofok9t4isfk9@node-1    | 03/Apr/2018:07:37:19 +0000 [ERROR 502 /metrics] context canceled
mon_dockerd-exporter.0.ofok9t4isfk9@node-1    | 03/Apr/2018:07:37:34 +0000 [ERROR 502 /metrics] context canceled
mon_dockerd-exporter.0.ofok9t4isfk9@node-1    | 03/Apr/2018:07:37:49 +0000 [ERROR 502 /metrics] context canceled
mon_dockerd-exporter.0.rfzakmu3h1ml@node-2    | Activating privacy features... done.
mon_dockerd-exporter.0.rfzakmu3h1ml@node-2    | http://:9323
mon_dockerd-exporter.0.rfzakmu3h1ml@node-2    | 03/Apr/2018:07:36:37 +0000 [ERROR 502 /metrics] context canceled
mon_dockerd-exporter.0.rfzakmu3h1ml@node-2    | 03/Apr/2018:07:36:52 +0000 [ERROR 502 /metrics] context canceled
mon_dockerd-exporter.0.rfzakmu3h1ml@node-2    | 03/Apr/2018:07:37:07 +0000 [ERROR 502 /metrics] context canceled
mon_dockerd-exporter.0.rfzakmu3h1ml@node-2    | 03/Apr/2018:07:37:22 +0000 [ERROR 502 /metrics] context canceled
mon_dockerd-exporter.0.rfzakmu3h1ml@node-2    | 03/Apr/2018:07:37:37 +0000 [ERROR 502 /metrics] context canceled
mon_dockerd-exporter.0.rfzakmu3h1ml@node-2    | 03/Apr/2018:07:37:52 +0000 [ERROR 502 /metrics] context canceled
mon_dockerd-exporter.0.z9ud5s9c2u2s@node-3    | Activating privacy features... done.
mon_dockerd-exporter.0.z9ud5s9c2u2s@node-3    | http://:9323
mon_dockerd-exporter.0.z9ud5s9c2u2s@node-3    | 03/Apr/2018:07:36:36 +0000 [ERROR 502 /metrics] context canceled
mon_dockerd-exporter.0.z9ud5s9c2u2s@node-3    | 03/Apr/2018:07:36:51 +0000 [ERROR 502 /metrics] context canceled
mon_dockerd-exporter.0.z9ud5s9c2u2s@node-3    | 03/Apr/2018:07:37:06 +0000 [ERROR 502 /metrics] context canceled
mon_dockerd-exporter.0.z9ud5s9c2u2s@node-3    | 03/Apr/2018:07:37:21 +0000 [ERROR 502 /metrics] context canceled
mon_dockerd-exporter.0.z9ud5s9c2u2s@node-3    | 03/Apr/2018:07:37:36 +0000 [ERROR 502 /metrics] context canceled
mon_dockerd-exporter.0.z9ud5s9c2u2s@node-3    | 03/Apr/2018:07:37:51 +0000 [ERROR 502 /metrics] context canceled

Grafana is not showing docker worker nodes on Windows

Swarmprom is sucessfully running on Ubuntu Machine.
Currentl it is not showing worker nodes

Kindly assist me

Alertmanager fails to start

Hi !

the alertmanager container is stuck in an endless loop of starting and exiting straight away. These are the logs that I can get from a container:

time="2017-10-02T15:13:34Z" level=info msg="Starting alertmanager (version=0.8.0, branch=HEAD, revision=74e7e48d24bddd2e2a80c7840af9b2de271cc74c)" source="main.go:109"
time="2017-10-02T15:13:34Z" level=info msg="Build context (go=go1.8.3, user=root@439065dc2905, date=20170720-14:14:06)" source="main.go:110"
time="2017-10-02T15:13:34Z" level=info msg="Loading configuration file" file="/etc/alertmanager/alertmanager.yml" source="main.go:234"
time="2017-10-02T15:13:34Z" level=error msg="Loading configuration file failed: no global Slack API URL set" file="/etc/alertmanager/alertmanager.yml" source="main.go:237"

I've set the env variables for each of these ADMIN_USER, ADMIN_PASSWORD, SLACK_URL, SLACK_CHANNEL, SLACK_USER and I don't know what else to do to make this to work properly.

Monitor Missing/Crashing containers

Hi Stefan,

Need you advise please to understand how to monitor if a container is not running (the reasons could be someone deleted the container, crashed, etc etc).

Eg. I 've a kafka cluster with 3 zookeeper nodes and 3 kafka nodes. I want to be altered if any of the kafka or zookeeper node goes down or is not responding.

Since your setup I can't put additional configs in Prometheus.yml, how can I create such rules with the rules file?

caddy not starting.

Hi,

the caddy server is not starting, when I do a docker service ls, all the services I see as started with caddy only having replica as 0/1.
I did inspect and its doesn't show any error or even no logs o/p too from the container. When I remove the stack and redeploy it, sometimes the health of the caaddy container is starting and sometimes it's unhealthy.

I 'm running this on a Ubuntu 16-04 node with latest docker version.

Progress status

Wow, I'm surprise to see this stack. I was think about migrating your dockprom project :)

As I see you are working actively on it. Do you consider it ready for other folks to try it?

node_meta metrics are messy on Prometheus console

I have tried to use Prometheus to monitor two docker swarms together refer to your swarmprom guide.
Since Prometheus is not in the same overlay network with the monitored nodes, I tried to use static_config instead of dns_sd_configs:

Deploy node-exporter, cadvisor, dockerd-exporter as global service on two docker swarm seperately.
Add all node-exporter, cadvisor, dockerd-exporter targets using static_config in prometheus.yml
eg.
scrape_configs:

job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
job_name: 'node-exporter'
static_configs:
- targets: ['infbjsrv35.cn.oracle.com:9100','infbjsrv36.cn.oracle.com:9100','infbjvm539.cn.oracle.com:9100','infbjvm223.cn.oracle.com:9100']

Start Prometheus, alertmanager and unsee on another host(which is not node of any swarm)
When check node_meta metrics on Prometheus console, I found the node_meta is messy.
In each swarm, the node_meta data from one node will mismach each node exporter instance to composed a node meta metric.
For eg. swarm “A” has two nodes: infbjsrv35.cn.oracle.com and infbjvm223.cn.oracle.com
node_meta from http://infbjsrv35.cn.oracle.com:9100/metrics is
node_meta{container_label_com_docker_swarm_node_id="n9x7iwqhqe51y80c00a5c16fd",node_id="n9x7iwqhqe51y80c00a5c16fd",node_name="infbjsrv35.cn.oracle.com"} 1

node_meta from http://infbjvm223.cn.oracle.com:9100/metrics is
node_meta{container_label_com_docker_swarm_node_id="wx86gspnvhgdli8kq0k93m392",node_id="wx86gspnvhgdli8kq0k93m392",node_name="infbjvm223.cn.oracle.com"} 1

But from Prometheus console, the result of executing node_meta will show 4 metrics, mismached the instances and the node meta data:
node_meta{container_label_com_docker_swarm_node_id="n9x7iwqhqe51y80c00a5c16fd",instance="infbjvm223.cn.oracle.com:9100",job="node-exporter",node_id="n9x7iwqhqe51y80c00a5c16fd",node_name="infbjsrv35.cn.oracle.com"} | 1
node_meta{container_label_com_docker_swarm_node_id="n9x7iwqhqe51y80c00a5c16fd",instance="infbjsrv35.cn.oracle.com:9100",job="node-exporter",node_id="n9x7iwqhqe51y80c00a5c16fd",node_name="infbjsrv35.cn.oracle.com"} | 1
node_meta{container_label_com_docker_swarm_node_id="wx86gspnvhgdli8kq0k93m392",instance="infbjvm223.cn.oracle.com:9100",job="node-exporter",node_id="wx86gspnvhgdli8kq0k93m392",node_name="infbjvm223.cn.oracle.com"} | 1
node_meta{container_label_com_docker_swarm_node_id="wx86gspnvhgdli8kq0k93m392",instance="infbjsrv35.cn.oracle.com:9100",job="node-exporter",node_id="wx86gspnvhgdli8kq0k93m392",node_name="infbjvm223.cn.oracle.com"} | 1

I can not understand why this happen, and why dns_sd_configs can collect the right node metadata.
Can you help me?

Adapt the Prometheus dashboard for v2

The current Prometheus dashboard doesn't work since it's made for Prom 1.x

Prometheus 502 Bad Gateway Error

Hello,

I'm new to Docker, Prometheus and Grafana. Trying to learn the basic stuff. I followed the steps that has been said in this repository.I have no problem reaching to Grafana, Alert Manager with <swarm_ip>:xxxx, but when I try to reach Prometheus, <swarm_ip>:9090 I get a 502 Bad Gateway error. Unfortunately I couldn't find a documentation on Prometheus errors.

PS: Thanks for the great tutorial.

Idea: Working around complicated hostname vs. container ip...

In my case it is possible to manually define the hosts to scrape (with hostnames) because they normally do not change.
Then I simply mapped the cAdvisor and node_exporter ports to the host machine so I can combine docker, cAdvisor and node_exporter metrics.
Is this a good, bad or ugly way?
Just an idea...

Task rules Slack notification appear wrong

Hello,
can you help me? Why following task_high_memory_usage_1g defined task rule (default):

  - alert: task_high_memory_usage_1g
    expr: |
      sum(container_memory_rss{container_label_com_docker_swarm_task_name=~".+"})
      BY (container_label_com_docker_swarm_task_name, container_label_com_docker_swarm_node_id) > 1e+09
    for: 5m
    annotations:
      description: '{{ $labels.container_label_com_docker_swarm_task_name }} on ''{{ $labels.container_label_com_docker_swarm_node_id }}'' memory usage is {{ humanize $value }}.'
      summary: Memory alert for Swarm task '{{ $labels.container_label_com_docker_swarm_task_name }}' on '{{ $labels.container_label_com_docker_swarm_node_id }}'

Appers in Slack like below?

No description or other annotations.

task_high_cpu_usage_50 task rule appears correctly:

Thank you.

Grafana dashboard

Dashboard and datasource are no longer included after login. there was no changes to the repo, just doing regular docker stack deploy

Question about service discovery

Hi I d like use Prometheus in swarm . It is not clear for me if I need to add in the composer consul installation or if consul and registrator is already present inside this bundle . In the first case is there particular setting in Prometheus to add ?

Prometheus only getting metrics from manager node

I'm new to using Prometheus and I would really appreciate some help. I've been looking into this issue for quite a bit. I have a swarm of machines with 1 manager and 7 workers. The manager is on a digital ocean instance and the workers are physical machines on my local network.

The problem is when I go to the Grafana dashboard only 1 node is being detected. When I visit the prometheus targets url at port 9090, I see 8 endpoints but only 1 is up. The rest have an error that says "context_deadline_exceeded".

On each machine, I have set the metrics address to 0.0.0.0:9323 and experimental mode is set to true. I have also enabled port 2376 on the machines, 7946, and 4789.

Any suggestions to get metrics for the other nodes is much appreciated. Thank you!

No swarm manger and swarm nodes are visible in alert manager?

Please help to resolve this issue.

Would engine metrics be insecure?

Using experimental and 0.0.0.0:9323 pretty much export the port to the public is there other secure way to export this, and not show it to anyone?

Not able to monitor Swarm master in Grafana ?

In current setup, we have 3 nodes and 1 master.

All nodes are visible properly on Grafana but Master is not visible Grafana.

Please help me to resolve this issu.

Thanks in advance :-)

Only see 2 nodes out of the 3 masters

I have deployed swarmprom on a 3 nodes cluster on Docker for AWS. All nodes are masters and are running fine but and only 2 nodes are listed in Grafana, a couple of my app stacks are also missing.
All the swarmprom services seem to run fine though.
Any hints ?

btw, thanks a lot, really great project ! 👍

Grafana does not detect any of the docker swarm Dashboards

I've tried this a few times, and logged in an verified the docker swarm nodes and services dashboards are present in the /etc/grafana/dashboards directory, however it never sees them for import.

When I manually import the json files, they result in completely blank dashboards.

Disable Basic Authentication

How to do i disable the basic authentication that is now required for me to login. I understand that caddy service is responsible for authentication but i cant figure out how to disable it. Any idea?

store prometheus metrics in postgresql

I'm trying to store prometheus metrics in postgresql based on prometheus-postgresql-adapter. I modified the docker-compose.yml to the docker-compose-pg-old.yml.pdf (which includes 2 additional services corresponding to the first 2 containers in prometheus-postgresql-adapter, and comments out the local storage for prometheus). The prometheus.yml is modified as shown in the prometheus.yml.pdf to direct "read" and "write" to postgresql. I had to build the prometheus docker image to include the modified prometheus.yml.

The stack is deployed under the name "mon". The mon_prometheus should connect to the "mon_prometheus_postgresql_adapter", which in turn connects to mon_pg_prometheus (the postgresql database). The problem is that "mon_prometheus" service is unable to connect the "mon_prometheus_postgresql_adapter". The logs from "mon_prometheus" says:

level=error ts=2018-02-20T04:13:33.284782524Z caller=engine.go:544 component="query engine" msg="error selecting series set" err="error sending request: Post http://mon_prometheus_postgresql_adapter:9201/read: dial tcp: lookup mon_prometheus_postgresql_adapter on 127.0.0.11:53: no such host"

Regards

Disable Basic Authentication

How to Disable basic Authentication. I understand that caddy service is responsible for authentication. How do I bypass this basic authentication. Any help? Thanks

Prometheus container is continuously restarting "Received SIGTERM, exiting gracefully..."

I am using this repo to create monitoring stack for our production swarm environments.
Have made some changes in prometheus configuration
Can you please help me to fix this problem.

removed docker-enterypoint.sh
Attached herewith my prometheus.yaml file
Attached herewith prometheus dockerfile
Modified docker-compose.yml
Share whole code @ https://codeshare.io/5gb8My

I could deploy all services except getting below error on prometheus container

`deb795407a (none))"

level=info ts=2018-03-07T17:07:38.10631854Z caller=main.go:228 fd_limits="(soft=1048576, hard=1048576)"
level=info ts=2018-03-07T17:07:38.109652503Z caller=main.go:502 msg="Starting TSDB ..."
level=info ts=2018-03-07T17:07:38.127573843Z caller=web.go:383 component=web msg="Start listening for connections" address=0.0.0.0:9090
level=info ts=2018-03-07T17:07:38.574693038Z caller=main.go:512 msg="TSDB started"
level=info ts=2018-03-07T17:07:38.574933556Z caller=main.go:588 msg="Loading configuration file" filename=/etc/prometheus/prometheus.yml
level=info ts=2018-03-07T17:07:38.578334416Z caller=main.go:489 msg="Server is ready to receive web requests."
level=warn ts=2018-03-07T17:08:05.313728189Z caller=main.go:366 msg="Received SIGTERM, exiting gracefully..."
level=info ts=2018-03-07T17:08:05.313788495Z caller=main.go:390 msg="Stopping scrape discovery manager..."
level=info ts=2018-03-07T17:08:05.3138142Z caller=main.go:403 msg="Stopping notify discovery manager..."
level=info ts=2018-03-07T17:08:05.313828264Z caller=main.go:427 msg="Stopping scrape manager..."
level=info ts=2018-03-07T17:08:05.313855348Z caller=main.go:386 msg="Scrape discovery manager stopped"
level=info ts=2018-03-07T17:08:05.313893078Z caller=main.go:399 msg="Notify discovery manager stopped"
level=info ts=2018-03-07T17:08:05.31401654Z caller=main.go:421 msg="Scrape manager stopped"
level=info ts=2018-03-07T17:08:05.317560586Z caller=manager.go:460 component="rule manager" msg="Stopping rule manager..."
level=info ts=2018-03-07T17:08:05.317627258Z caller=manager.go:466 component="rule manager" msg="Rule manager stopped"
level=info ts=2018-03-07T17:08:05.31764061Z caller=notifier.go:493 component=notifier msg="Stopping notification manager..."
level=info ts=2018-03-07T17:08:05.317659353Z caller=main.go:573 msg="Notifier manager stopped"
level=info ts=2018-03-07T17:08:05.317714607Z caller=main.go:584 msg="See you next time!"`

`docker@manager:/Users/gaurav.goyal/gg/swarmprom/prometheus/conf$ cat prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s

external_labels:
monitor: 'promswarm'

rule_files:

"swarm_node.rules.yml"
"swarm_task.rules.yml"
alerting:
alertmanagers:

static_configs:
targets:
alertmanager:9093
scrape_configs:

job_name: 'prometheus'
static_configs:
targets: ['localhost:9090']

job_name: 'dockerd-exporter'
dns_sd_configs:
names:
'tasks.dockerd-exporter'
type: 'A'
port: 9323
job_name: 'cadvisor'
dns_sd_configs:

names:
'tasks.cadvisor'
type: 'A'
port: 8080
job_name: 'node-exporter'
dns_sd_configs:

names:
'tasks.node-exporter'
type: 'A'
port: 9100
job_name: 'grafana'
dns_sd_configs:

names:
'tasks.grafana'
type: 'A'
port: 3000 FROM prom/prometheus:v2.2.0-rc.0

COPY conf/ /etc/prometheus/

#ENTRYPOINT [ "/etc/prometheus/docker-entrypoint.sh" ]
CMD [ "--config.file=/etc/prometheus/prometheus.yml",
"--storage.tsdb.path=/prometheus",
"--web.console.libraries=/usr/share/prometheus/console_libraries",
"--web.console.templates=/usr/share/prometheus/consoles" ]`

Monitoring http code status

Hi,

Is possible with swarmprom monitoring HTTP status code of my web applications? I like to get a slack notification if my application don't return HTTP 200 code.

Thanks!

Unable to scrape outside the swarm.

I'm having trouble scraping data outside of the swarm. I do not get any errors but no data shows up. Here is my prometheus.yml. Its the default file with very minor changes. Any thoughts?

global:
scrape_interval: 15s
evaluation_interval: 15s

external_labels:
monitor: 'promswarm'

rule_files:

"swarm_node.rules.yml"
"swarm_task.rules.yml"

alerting:
alertmanagers:

static_configs:
- targets:
  - alertmanager:9093

scrape_configs:

job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
job_name: 'dockerd-exporter'
dns_sd_configs:
- names:
  - 'tasks.dockerd-exporter'
    type: 'A'
    port: 9323
job_name: 'cadvisor'
dns_sd_configs:
- names:
  - 'tasks.cadvisor'
    type: 'A'
    port: 8080
job_name: 'node-exporter'
dns_sd_configs:
- names:
  - 'tasks.node-exporter'
    type: 'A'
    port: 9100
job_name: 'perforce_node_exporter'
scrape_interval: 30s
static_configs:
- targets:
  - xxx.xxx.xxx.xxx
  - xxx.xxx.xxx.xxx
  - xxx.xxx.xxx.xxx
  - xxx.xxx.xxx.xxx
  - xxx.xxx.xxx.xxx

ADMIN vars ???

Been reading the README and it doesn't state WHERE we put the ADMIN vars. No file is given in the README. I see them referenced in the code but I see no place to set them and can only ASSUME we set those in bash as env variables.

But in looking at issue #2 (#2), it looks like we don't... but it doesn't state WHAT FILE to declare those in.

Can someone clarify this in documentation???

How to Monitor Nodes outside the swarm cluster.

I'm trying to monitor few nodes outside the swarm cluster but unable to reach those nodes from inside the prometheus container

relabing metrics

men if there is a better way to use this "sum(node_memory_MemAvailable * on(instance) group_left(node_id, node_name) node_meta) by (node_id, node_name)", i'll appreciate maybe some metrics_relabel thanks

Not able to monitor 3rd party exporters

Hi Stefan,

I tried to follow the "https://github.com/stefanprodan/swarmprom#monitoring-applications-and-backend-services" to monitor kafka and MySQL services using Prometheus provided exporters for these tools.

Eg. this one for MySQL
https://github.com/prometheus/mysqld_exporter

I configured this in docker-compose file

    environment:
      - JOBS=kafka-exporter:9308 mysql-exporter:9104

Now I can see the metrics from the web browser. But my Prometheus is not scraping any metrics from them.

So I have some confusion here.

I 've attached these exporter containers to my mon_net network. But I started them with the docker run command, do I need to start with them with stack?
If I want to use the blackbox exporter which needs much more arguments than the exporter name and port how do I pass them to the container? given that I can't edit the Prometheus.yml file.

Thanks for the help.

Regards,
Ashish

Prometheus 2.0

Prometheus 2 is already released. Is it going to be supported instead of current 1.8?

https://prometheus.io/docs/prometheus/2.0/migration/

swarm nodes

first thanks for this nice stack!
for some reason swarm node dashboard always shows wrong number of nodes, it is correct in services dashboard but not in swarm nodes, any idea what it could be?

Change default web access port 3000 by 80 (443)

Hi , First of all, thank you for the job you're doing. When the stack is deployed the access port to the web interface is 3000. How can it be changed to 80 (443 eventually)?

Thanks in advance for helping :)

Having an issue with adding service monitoring.

When I try to monitor an application, for example Redis, I'm having issues.
My config:

*docker-compose.yml:
prometheus:
image: stefanprodan/swarmprom-prometheus
environment:

JOBS=redis-exporter:9121

*prometheus.yml:
job_name: 'redis-exporter'
dns_sd_configs:
names:
'tasks.redis-exporter'
type: 'A'
port: 9121
*compose-redis.yml:
version: '3'

networks:
mon_net:
external: true

services:
redis:
image: redis
networks:

mon_net
ports:
"6379:6379"
deploy:
mode: global

redis-exporter:
image: oliver006/redis_exporter
networks:

mon_net
ports:
"9121:9121"
deploy:
mode: global

When I run the monitoring stack and then compose-redis:

Prometheus goes up and down all the time.

Log shows:

level=error ts=2018-02-19T16:49:15.594740858Z caller=main.go:582 err="Error loading config couldn't load configuration (--config.file=/etc/prometheus/prometheus.yml): parsing YAML file /etc/prometheus/prometheus.yml: unknown fields in alertmanager config: job_name"

I have no idea how to fix this or what I did wrong.
Any help would be appreciated.

Sorry for posting in the wrong place at first.

Thanks

Grafana reporting 171% available disk space

Bug Report

What did you do?
Deployed swarmprom in my Swarm cluster, logged into Grafana, and noticed that the available disk space exceeds 100%

What did you expect to see?
A value lower or at most equal to 100%

What did you see instead? Under which circumstances?
171%, every time

Is it a bug in the node-exporter data?
The df -h of the first of the two nodes is:

[msadmin@MS-DSC1 ~]$ df -h
Filesystem                                    Size  Used Avail Use% Mounted on
/dev/sda2                                      30G  4.1G   26G  14% /
devtmpfs                                      3.9G     0  3.9G   0% /dev
tmpfs                                         3.9G     0  3.9G   0% /dev/shm
tmpfs                                         3.9G  377M  3.6G  10% /run
tmpfs                                         3.9G     0  3.9G   0% /sys/fs/cgroup
/dev/sda1                                     497M  105M  392M  22% /boot
/dev/sdb1                                      16G   45M   15G   1% /mnt/resource
//msshare.file.core.windows.net/msshare  5.0T  8.5M  5.0T   1% /mnt/msshare
tmpfs                                         797M     0  797M   0% /run/user/1000

and the second is virtually identical.
Thank you,
Roberto

alertmanager container mutates config file

It would be better IMO to just copy the alertmanager.yml file into the /tmp folder in the Dockerfile and have the entrypoint perform the file modifications as a part of the copy.

If I try and add a docker config file to the path /etc/alertmanager/alertmanager.yml i get the error

mv: can't rename '/tmp/alertmanager.yml': Device or resource busy

Support for Arm (raspberry Pi 3)

Hello Stefan,

Great project + blog explaining the whole thing!!

Is there any chance to have this project working on a docker swarm build upon 5 raspberry pi 3 nodes?

Greetz,
Raymond