This is an on-prem installation of 3 scylla servers with a monitoring soluiotn on one of the clients, all connected in the same private network 10.9.31.xx
Followed the instructions on the the git page. Installed a new setup on a server, Installed dockers and created the git clone, the setting up of the monitoring tool is not idiotproof, and requires extensive work to make it work, please help to ease the use of the system.
- Add alerts when one of the docker containers (grafana or Prometheus) is not up, will save a lot of hassle, add a manual on how to look for the issue (journalctl -xe ?)
- Verify how many servers are up and how many are down in a manner that makes sense, an N/A note on dead servers that are actually up(phantom) is not helpfull.
- Are there any iptables/firewall requirement to set the connection between the containers/host and the Scylla servers?
- Add example line on how to add servers to the prometheus.yml file, a single server 127.0.0.1 is not explaining on how to add multiple servers to monitor.
- exporter on the servers, constantly crashes and does not reflect on the monitoring tool, the server is up the exported is down, server is considered dead or N/A.
- If docker daemon is not running, either start it with the start script, or exit immediately, do not try to start the rest of the tools. There is a string of continuous dots printing to the screen with no information on what's going on what does the tool trying to do.
[root@localhost ~]# git clone https://github.com/scylladb/scylla-grafana-monitoring.git
Cloning into 'scylla-grafana-monitoring'...
remote: Counting objects: 280, done.
remote: Compressing objects: 100% (42/42), done.
remote: Total 280 (delta 21), reused 0 (delta 0), pack-reused 236
Receiving objects: 100% (280/280), 64.42 KiB | 0 bytes/s, done.
Resolving deltas: 100% (140/140), done.
[root@localhost ~]# ls -ltr
total 12
-rw-------. 1 root root 1556 Sep 7 17:40 anaconda-ks.cfg
drwxr-xr-x. 2 root root 23 Sep 23 11:16 cassandra.logdir_IS_UNDEFINED
-rw-r--r--. 1 root root 3416 Sep 28 16:19 loadmlnx.yaml
drwxr-xr-x. 5 root root 4096 Oct 4 01:18 scylla-grafana-monitoring
[root@localhost ~]# service docker start
Redirecting to /bin/systemctl start docker.service
[root@localhost ~]# cd scylla-grafana-monitoring/
[root@localhost scylla-grafana-monitoring]# cd prometheus/
[root@localhost prometheus]# vi prometheus.yml
[root@localhost prometheus]# cd ../
[root@localhost scylla-grafana-monitoring]# ./start-all.sh
Unable to find image 'prom/prometheus:v1.0.0' locally
Trying to pull repository docker.io/prom/prometheus ...
v1.0.0: Pulling from docker.io/prom/prometheus
385e281300cc: Pull complete
a3ed95caeb02: Pull complete
e418e02f5f37: Pull complete
6c2c7730b5ef: Pull complete
bbc184d7f32a: Pull complete
17a6ebba0cea: Pull complete
d1b2d64d311e: Pull complete
356f67417ef1: Pull complete
Digest: sha256:13cca70de2522231af89f19fc246fad6bc594698ede40fc7712a74ce71f1068f
Status: Downloaded newer image for docker.io/prom/prometheus:v1.0.0
0c9ffbb5da10e333e2e702a4f1585c0ded7c0130efb6cf3584475aa8a5a09353
Unable to find image 'grafana/grafana:3.1.0' locally
Trying to pull repository docker.io/grafana/grafana ...
3.1.0: Pulling from docker.io/grafana/grafana
5c90d4a2d1a8: Pull complete
b1a9a0b6158e: Pull complete
acb23b0d58de: Pull complete
Digest: sha256:3476700a51ff136a507f9d09a6626964b6cfbc9352ed23e0063d8785d2b2c30f
Status: Downloaded newer image for docker.io/grafana/grafana:3.1.0
7b452d487663df60df543fe17c9e3a0396e01f8c6118d628d0c83f3025670d25
.HTTP/1.1 200 OK
Content-Type: application/json; charset=UTF-8
Set-Cookie: grafana_sess=bea800eac5a7fac0; Path=/; HttpOnly
Date: Tue, 04 Oct 2016 08:21:05 GMT
Content-Length: 37
{"id":1,"message":"Datasource added"}HTTP/1.1 100 Continue
HTTP/1.1 200 OK
Content-Type: application/json
Set-Cookie: grafana_sess=cfe2c9b9168e0059; Path=/; HttpOnly
Date: Tue, 04 Oct 2016 08:21:05 GMT
Content-Length: 64
{"slug":"scylla-cluster-metrics","status":"success","version":0}HTTP/1.1 100 Continue
HTTP/1.1 200 OK
Content-Type: application/json
Set-Cookie: grafana_sess=405697f0dfdd1d35; Path=/; HttpOnly
Date: Tue, 04 Oct 2016 08:21:05 GMT
Content-Length: 67
{"slug":"scylla-per-server-metrics","status":"success","version":0}HTTP/1.1 100 Continue
HTTP/1.1 200 OK
Content-Type: application/json
Set-Cookie: grafana_sess=0220efe08badfc1e; Path=/; HttpOnly
Date: Tue, 04 Oct 2016 08:21:05 GMT
Content-Length: 68
{"slug":"scylla-per-server-disk-i-o","status":"success","version":0}[root@localhost scylla-grafana-monitoring]#
Added the servers trying to read from to the prometheus yml file:
cat prometheus/prometheus.yml
global:
scrape_interval: 15s # By default, scrape targets every 15 seconds.
Attach these labels to any time series or alerts when communicating with
external systems (federation, remote storage, Alertmanager).
external_labels:
monitor: 'scylla-monitor'
scrape_configs:
- job_name: scylla
honor_labels: true
static_configs:
- targets: ['10.9.31.182:9103','10.9.31.183:9103','10.9.31.184:9103']
Going to the web browser, pointing to 10.9.31.186, where my monitor system is installed, no data appears:
Looking into the data sources on the grafana setup, I see:
Tried to verify the installation, getting:
Tried to point the IP address of the setup(10.9.31.186), got the same error message:
Well, it seems that the prometheus server, didn't come up.
docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
70175d8a3609 grafana/grafana:3.1.0 "/run.sh" 11 seconds ago Up 9 seconds 0.0.0.0:3000->3000/tcp agraf
From some reason it didn't read the prometheus.yml file from the workin directory.
Oct 04 01:37:04 localhost.localdomain avahi-daemon[13905]: Withdrawing workstation service for veth4b6c354.
Oct 04 01:37:04 localhost.localdomain NetworkManager[1388]: (veth4b6c354): failed to disable userspace IPv6LL address handling
Oct 04 01:37:04 localhost.localdomain kernel: IPv6: ADDRCONF(NETDEV_CHANGE): veth409be50: link becomes ready
Oct 04 01:37:04 localhost.localdomain kernel: docker0: port 1(veth409be50) entered forwarding state
Oct 04 01:37:04 localhost.localdomain kernel: docker0: port 1(veth409be50) entered forwarding state
Oct 04 01:37:04 localhost.localdomain NetworkManager[1388]: (veth409be50): link connected
Oct 04 01:37:04 localhost.localdomain NetworkManager[1388]: (docker0): link connected
Oct 04 01:37:04 localhost.localdomain sudo[31344]: root : TTY=pts/0 ; PWD=/root/scylla-grafana-monitoring ; USER=root ; COMMAND=/bin/docker run -d
Oct 04 01:37:04 localhost.localdomain docker-current[28511]: time="2016-10-04T08:37:04Z" level=info msg="Starting prometheus (version=1.0.0, branch=mas
Oct 04 01:37:04 localhost.localdomain docker-current[28511]: time="2016-10-04T08:37:04Z" level=info msg="Build context (go=go1.6.2, user=root@98d6f3664
Oct 04 01:37:04 localhost.localdomain docker-current[28511]: time="2016-10-04T08:37:04Z" level=info msg="Loading configuration file /etc/prometheus/pro
Oct 04 01:37:04 localhost.localdomain docker-current[28511]: time="2016-10-04T08:37:04Z" level=error msg="Couldn't load configuration (-config.file=/et
Oct 04 01:37:04 localhost.localdomain docker-current[28511]: time="2016-10-04T01:37:04.891507324-07:00" level=info msg="{Action=create, Username=root,
Oct 04 01:37:04 localhost.localdomain systemd[1]: Stopped docker container 20e621214a5c26105dfe5e076a0dd440aa8911ff44ff149920f63a6072a4788b.
-- Subject: Unit docker-20e621214a5c26105dfe5e076a0dd440aa8911ff44ff149920f63a6072a4788b.scope has finished shutting down
Changing the starting script where forcing to read the yml file got the promethues container up:
from the starting script:
if [ -z $DATA_DIR ]
then
sudo docker run -d -**v /root/scylla-grafana-monitoring/prometheus/prometheus.yml -**p 9090:9090 --name aprom prom/prometheus:v1.0.0
else
echo "Loading prometheus data from $DATA_DIR"
sudo docker run -d -v $DATA_DIR:/prometheus:Z -v $PWD/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:Z -p 9090:9090 --name aprom prom/prometheus:v1.0.0
fi
Now got the source active:
Still, having 3 servers in the yml file list,
cat prometheus/prometheus.yml
global:
scrape_interval: 15s # By default, scrape targets every 15 seconds.
Attach these labels to any time series or alerts when communicating with
external systems (federation, remote storage, Alertmanager).
external_labels:
monitor: 'scylla-monitor'
scrape_configs:
- job_name: scylla
honor_labels: true
static_configs:
- targets: ['10.9.31.182:9103','10.9.31.183:9103','10.9.31.184:9103']
The monitoring show only one server :(
When trying to start the monitoring system again it gets halted:
for example:
e6b803cbcc89d11c20f128808eeea7a18192447a94367615592b4b48d0d1071c
79949fcb5d3f4208106f7c71a78ab168a811682238a2dd780d31f025e979da1e
............................ (and it goes to infinity and beyond) system is in this state for minutes until ctrl-c.
This is the information from journalctl -xe what are the containers trying to do?:
Oct 04 02:17:00 localhost.localdomain oci-systemd-hook[14799]: systemdhook : Skipping as container command is /run.sh, not init or systemd
Oct 04 02:17:00 localhost.localdomain kernel: docker0: port 2(veth02da683) entered disabled state
Oct 04 02:17:00 localhost.localdomain NetworkManager[1238]: (vethcd2d3e9): failed to find device 17 'vethcd2d3e9' with udev
Oct 04 02:17:00 localhost.localdomain NetworkManager[1238]: (vethcd2d3e9): new Veth device (carrier: OFF, driver: 'veth', ifindex: 17)
Oct 04 02:17:00 localhost.localdomain NetworkManager[1238]: (veth02da683): link disconnected
Oct 04 02:17:00 localhost.localdomain kernel: docker0: port 2(veth02da683) entered disabled state
Oct 04 02:17:00 localhost.localdomain avahi-daemon[1289]: Withdrawing workstation service for vethcd2d3e9.
Oct 04 02:17:00 localhost.localdomain avahi-daemon[1289]: Withdrawing workstation service for veth02da683.
Oct 04 02:17:00 localhost.localdomain kernel: device veth02da683 left promiscuous mode
Oct 04 02:17:00 localhost.localdomain kernel: docker0: port 2(veth02da683) entered disabled state
Oct 04 02:17:00 localhost.localdomain NetworkManager[1238]: (vethcd2d3e9): failed to disable userspace IPv6LL address handling
Oct 04 02:17:00 localhost.localdomain NetworkManager[1238]: (docker0): bridge port veth02da683 was detached
Oct 04 02:17:00 localhost.localdomain NetworkManager[1238]: (veth02da683): released from master docker0
Oct 04 02:17:00 localhost.localdomain NetworkManager[1238]: (veth02da683): failed to disable userspace IPv6LL address handling
Oct 04 02:17:00 localhost.localdomain kernel: XFS (dm-5): Unmounting Filesystem
Oct 04 02:17:14 localhost.localdomain kernel: docker0: port 1(veth6c657e7) entered forwarding state