gliderlabs / docker-consul Goto Github PK
View Code? Open in Web Editor NEWDockerized Consul
License: MIT License
Dockerized Consul
License: MIT License
We are building up an automation platform using registrator, and consul, and one use case uses a docker kill --signal=HUP sent to an nginx container to reload an automatically generated LB config, whenever a K/V is changed in consul.
We had consul implemented on the base hosts, and used a watch in consul to call a local script which sent the signal, something like
docker kill --signal="HUP" loadbalancer
Would like to be able to do the same with containerized consul agents.
Since we can mount the docker socket, and the docker binary is already available, would this be as simple as creating a new shell script such as this
#!/bin/bash docker kill --signal="HUP" loadbalancer
Perhaps this would be more generalized, so we could pass the name of the container to be HUPED?, or perhaps even another parameter for the signal name?
Am going to proceed with this, but wondering if you have better concepts? Perhaps this is something you have already considered and rejected, or do another way?
First of all, if I run:
$ docker run -d -p 8400:8400 -p 8500:8500 -p 8600:53/udp -h consul --name=consul progrium/consul -server -bootstrap -ui-dir /ui
$ curl http://$(docker-machine ip local):8500/v1/catalog/service/consul
[{"Node":"consul","Address":"172.17.0.17","ServiceID":"consul","ServiceName":"consul","ServiceTags":[],"ServiceAddress":"","ServicePort":8300}]
Why is "ServiceAddress" blank?
Then, when I run one of my containers:
nameserver 172.17.42.1
nameserver 8.8.8.8
search service.consul
But:
ping: unknown host consul
I noticed in Dockerfile (master), its using docker 1.5. I am using 1.6. So I tried my own image build with 1.6 but I get the same results.
I can ping:
PING 172.17.42.1 (172.17.42.1) 56(84) bytes of data.
64 bytes from 172.17.42.1: icmp_seq=1 ttl=64 time=0.144 ms
and when I use registrator and run another service, that service does resolve properly. So it seems that the consul container itself is the one having problems.
** I am on OSX, using docker-machine.
I imagine this is something I'm doing wrong, but I can't seem to get my health checks to pass.
The 2 consul agents are running on different servers and those servers can access all ports of each other on both UDP and TCP.
Server Logs:
2014/09/07 00:34:58 [INFO] consul: member 'ip-10-0-10-254' joined, marking health alive
2014/09/07 00:34:59 [INFO] memberlist: Suspect ip-10-0-10-254 has failed, no acks received
2014/09/07 00:35:01 [INFO] memberlist: Suspect ip-10-0-10-254 has failed, no acks received
2014/09/07 00:35:03 [INFO] memberlist: Suspect ip-10-0-10-254 has failed, no acks received
2014/09/07 00:35:04 [INFO] memberlist: Marking ip-10-0-10-254 as failed, suspect timeout reached
2014/09/07 00:35:04 [INFO] serf: EventMemberFailed: ip-10-0-10-254 10.0.10.254
2014/09/07 00:35:04 [INFO] consul: removing server ip-10-0-10-254 (Addr: 10.0.10.254:8300) (DC: dc1)
2014/09/07 00:35:04 [INFO] consul: member 'ip-10-0-10-254' failed, marking health critical
Client Logs:
2014/09/07 00:32:28 [INFO] serf: EventMemberJoin: ip-10-0-20-6 10.0.20.6
2014/09/07 00:32:28 [INFO] consul: adding server ip-10-0-20-6 (Addr: 10.0.20.6:8300) (DC: dc1)
2014/09/07 00:32:29 [INFO] memberlist: Suspect ip-10-0-20-6 has failed, no acks received
2014/09/07 00:32:31 [INFO] memberlist: Suspect ip-10-0-20-6 has failed, no acks received
2014/09/07 00:32:33 [INFO] memberlist: Suspect ip-10-0-20-6 has failed, no acks received
2014/09/07 00:32:34 [INFO] memberlist: Marking ip-10-0-20-6 as failed, suspect timeout reached
2014/09/07 00:32:34 [INFO] serf: EventMemberFailed: ip-10-0-20-6 10.0.20.6
2014/09/07 00:32:34 [INFO] consul: removing server ip-10-0-20-6 (Addr: 10.0.20.6:8300) (DC: dc1)
latest
tag at https://registry.hub.docker.com/u/progrium/consul/ was following amazing quickly the new release of consul 0.5.0. thanks for that!
Would be nice to be able to reference explicitly by tag docker pull progrium/consul:consul-0.5
Got this error scada-client: failed to dial: x509: failed to load system roots and no roots provided
when trying to use -atlas
and -atlas-join
.
Thanks.
I want to use the awesome cmd:run feature to bootstrap a client node.
I've been trying to think how to neatly slot that in to the cmd:run command without getting in the way of the docker arguments.
The best I can come up with is to append ::client after the join-ip on the assumption that clients would specify both an advertise and join ip (as opposed to the first server which just has advertise ip but therefore no client flag needed).
So something like this:
cmd:run 10.0.1.1::10.0.1.2::client -d -v /mnt:/data
Then the -server
flag can be filtered out if the client flag exists.
Do you have a preference for if/how this should be done. I didn't want to just open a pull-request assuming its a good idea.
P.S this whole docker consul world you are building is brilliant
If I wanted to create a script to do memory usage or pgrep
to make sure something was running on the node, how would I do that here?
Consul is running in single mode:
docker run -d -p 8400:8400 -p 8500:8500 -p 8600:53/udp -h node1 progrium/consul -server -bootstrap-expect 1
so after restart it seems like Consul cannot find himself:
2014/09/15 16:46:20 [ERR] raft: Failed to make RequestVote RPC to 172.17.0.12:8300: dial tcp 172.17.0.12:8300: no route to host
2014/09/15 16:46:21 [WARN] raft: Election timeout reached, restarting election
2014/09/15 16:46:21 [INFO] raft: Node at 172.17.0.13:8300 [Candidate] entering Candidate state
2014/09/15 16:46:23 [WARN] raft: Election timeout reached, restarting election
2014/09/15 16:46:23 [INFO] raft: Node at 172.17.0.13:8300 [Candidate] entering Candidate state
2014/09/15 16:46:24 [ERR] raft: Failed to make RequestVote RPC to 172.17.0.12:8300: dial tcp 172.17.0.12:8300: i/o timeout
2014/09/15 16:46:24 [ERR] raft: Failed to make RequestVote RPC to 172.17.0.12:8300: dial tcp 172.17.0.12:8300: no route to host
2014/09/15 16:46:24 [ERR] raft: Failed to make RequestVote RPC to 172.17.0.12:8300: dial tcp 172.17.0.12:8300: no route to host
2014/09/15 16:46:24 [WARN] raft: Election timeout reached, restarting election
2014/09/15 16:46:24 [INFO] raft: Node at 172.17.0.13:8300 [Candidate] entering Candidate state
2014/09/15 16:46:25 [ERR] agent: failed to sync remote state: No cluster leader
2014/09/15 16:46:26 [WARN] raft: Election timeout reached, restarting election
2014/09/15 16:46:26 [INFO] raft: Node at 172.17.0.13:8300 [Candidate] entering Candidate state
2014/09/15 16:46:27 [ERR] raft: Failed to make RequestVote RPC to 172.17.0.12:8300: dial tcp 172.17.0.12:8300: no route to host
2014/09/15 16:46:27 [ERR] raft: Failed to make RequestVote RPC to 172.17.0.12:8300: dial tcp 172.17.0.12:8300: no route to host
I tried adding a health check via the HTTP API (is there another way?).
curl -XPUT -d '{"Name": "test", "Check": { "script": "docker run --rm -v /var/run/docker.sock:/var/run/docker.sock --entrypoint /bin/bash progrium/consul check-http e1d219c59ded 2368 / -L", "interval": "10s" }}' http://127.0.0.1:8500/v1/agent/service/register -v
Here's the result:
time="2015-03-29T04:02:14Z" level="fatal" msg="Post http:///var/run/docker.sock/v1.17/containers/create: dial unix /var/run/docker.sock: no such file or directory. Are you trying to connect to a TLS-enabled daemon without TLS?"
Hi, i have pulled consul with "docker pull progrium/consul" and started with docker-compose "docker-compose up -d consul":
consul:
image: docker.io/progrium/consul
command: run -p 8400:8400 -p 8500:8500 -p 8600:53/udp -h node1 progrium/consul -server -bootstrap -ui-dir /ui
The container started and up but if i get a "curl http://localhost:8500/ui" i have "curl: (7) Failed connect to localhost:8500; Connection refused"
UPDATE
If i run "netstat -lntu" there are no demons listen on ports 8500 and 8600 and 53
Any idea what this means?
Error response from daemon: client and server don't have same version (client : 1.17, server: 1.15)Error response from daemon: client and server don't have same version (client : 1.17, server: 1.15)docker: "run" requires a minimum of 1 argument. See 'docker run --help'.
I get it when I use the "check-cmd {container_name}".
I 've 2 hosts where I 've docker running. Consul is also running on a docker container.
[root@docker2 ~]#docker exec -it 7e74638cb211 consul members
Node Address Status Type Build Protocol
node1 172.16.131.253:8301 alive server 0.5.0 2
node2 172.16.131.252:8301 alive server 0.5.0 2
I shows both my node as alive whereas my node2 is not even running docker at the moment. Docker on this node actually kind of crashed by consul is not able to detect that.
[root@docker1 ~]# docker ps
FATA[0000] Cannot connect to the Docker daemon. Is 'docker -d' running on this host?
Please suggest.
The ONBUILD ADD ./config /config/
instruction in the Dockerfile
forces child Dockerfile
s to have their own config
directory. This makes sense in most cases which follow the common pattern of including configuration files in a dependent image. The problem is that this prevents dependent images that are not making any changes to config
from being built since the ONBUILD
trigger doesn't look like it can be turned off. My example is in building an image with a specific version of Docker (right now this image has Docker 1.5.0 and CoreOS's stable release is only up to 1.4.1).
I would argue that users who want to attach their own configs to an image will know they need a command for that and won't easily miss using their own ADD
/COPY
statement. Since the cost of having the ONBUILD
trigger is preventing other use cases from building without a workaround I ask that you consider removing it.
I'm running consul with default settings: docker run -d -p 8400:8400 -p 8500:8500 -p 172.17.42.1:53:53/udp -h consul progrium/consul:latest -server -bootstrap
and i have "--dns 172.17.42.1" in DOCKER_OPTS.
Then, if i issue dig @172.17.42.1 consul.node.consul
from my host shell, i will get
;; ANSWER SECTION:
consul.node.consul. 0 IN A 172.17.0.6
but docker run --rm aanand/docker-dnsutils dig consul.node.consul
times out.
would it be worth adding documentation on how to run on AWS, like you've done with boot2docker for OSX? I may have gotten completely mixed up, but this ended up working for me:
docker --daemon \
--host=fd:// \
--dns 172.17.42.1 \
--dns $(awk '/^nameserver/{print $2}' /etc/resolv.conf) \
--dns-search service.consul \
$DOCKER_OPTS
the first --dns
is docker0 veth, so the value tends to default to 172.17.42.1
. The 2nd --dns
gets the aws provided nameserver. If you don't set this up, and use 8.8.8.8
per the documentation, you need to open up the :53
ports on the AWS VPC security groups and network acls.
@progrium what do you think? If this smells right, I'll submit a PR to add this to the README.md
thanks!
EDIT: removed --dns $(ip ro | awk '/docker0/{print $9}')
and replaced with --dns 172.17.42.1
as the docker0 bridge doesn't exist on initial boot
Hello,
I use consul with lots of Docker projects that are based on Centos and it is rather tedious and inefficient having to install unzip
on every container, please provider the package in tar
and relatives as well.
Yet better, release them through Github as they have a lot faster servers than your current setup.
Thanks for the great work.
If I up my container with docker 1.4.0 I get the following error
consul_1 | ==> WARNING: It is highly recommended to set GOMAXPROCS higher than 1
consul_1 | ==> Starting Consul agent...
consul_1 | ==> Starting Consul agent RPC...
consul_1 | ==> Joining cluster...
consul_1 | ==> Reading remote state failed: EOF
opt_consul_1 exited with code 1
If I downgrade to 1.3.3 and re-run the container it works
Hi,
I am trying to rebuild the image using https://dl.bintray.com/mitchellh/consul/0.3.1_linux_amd64.zip
however the build process reports this error:
Step 6 : RUN opkg-install curl bash
---> Running in bc940d92ad0e
Downloading http://downloads.openwrt.org/snapshots/trunk/x86_64/packages/Packages.gz.
Inflating http://downloads.openwrt.org/snapshots/trunk/x86_64/packages/Packages.gz.
Updated list of available packages in /var/opkg-lists/snapshots.
Installing curl (7.36.0-1) to root...
Downloading http://downloads.openwrt.org/snapshots/trunk/x86_64/packages/curl_7.36.0-1_x86_64.ipk.
Installing libcurl (7.36.0-1) to root...
Downloading http://downloads.openwrt.org/snapshots/trunk/x86_64/packages/libcurl_7.36.0-1_x86_64.ipk.
Installing libpolarssl (1.3.7-1) to root...
Downloading http://downloads.openwrt.org/snapshots/trunk/x86_64/packages/libpolarssl_1.3.7-1_x86_64.ipk.
Unknown package 'bash'.
Configuring libpolarssl.
Configuring libcurl.
Configuring curl.
Collected errors:
* opkg_install_cmd: Cannot install package bash.
When trying to execute the image, it will not run, reporting:
Error response from daemon: Cannot start container 1745aaa4835958e2459647cd3790dfe3d96b066b74b183e8eadfbf3263fba7f2: no such file or directory
Which I believe is due to 'start' requiring /bin/bash
Do you have any idea why bash is no longer listed in http://downloads.openwrt.org/snapshots/trunk/x86_64/packages/Packages.gz and not available for download? I haven't found the build origin for snapshots/trunk yet..
This problem is shared between docker-consul and registrator, but the simple fix would be applied to this repo.
When I try to set an env var for registrator like this:
SERVICE_5672_CHECK_CMD=/usr/bin/rabbitmqadmin -u 'monitor' -p 'foobar' show overview
I get errors of the form in my consul logs:
exec: "/usr/bin/rabbitmqadmin -u 'monitor' -p 'foobar' show overview": stat /usr/bin/rabbitmqadmin -u 'monitor' -p 'foobar' show overview: no such file or directory
If I emulate the command by hand by running it from the command line, I can replicate the error when putting quotes around the command:
$ docker run --rm --net container:rabbitmq nadt.net/nadt-rabbitmq:latest "/usr/bin/rabbitmqadmin -u 'monitor' -p 'foobar' show overview"
exec: "/usr/bin/rabbitmqadmin -u 'monitor' -p 'foobar' show overview": stat /usr/bin/rabbitmqadmin -u 'monitor' -p 'foobar' show overview: no such file or directory2014/10/06 23:01:40 Error response from daemon: Cannot start container 886bf462d44611b2a8282757f5d164c9ab8b78bf834269c6fd288833b66fea9e: exec: "/usr/bin/rabbitmqadmin -u 'monitor' -p 'foobar' show overview": stat /usr/bin/rabbitmqadmin -u 'monitor' -p 'foobar' show overview: no such file or directory
but if I leave them off, the check works:
$ docker run --rm --net container:rabbitmq nadt.net/nadt-rabbitmq:latest /usr/bin/rabbitmqadmin -u 'monitor' -p 'foobar' show overview
+------------------+--------------------+-----------------------+----------------------+
| rabbitmq_version | cluster_name | queue_totals.messages | object_totals.queues |
+------------------+--------------------+-----------------------+----------------------+
| 3.3.5 | rabbit@rabbit-nadt | 0 | 0 |
+------------------+--------------------+-----------------------+----------------------+
check-cmd always puts the entire command in quotes, but given that docker attempts to stat the command, I'm not sure they are serving any real purpose. Any protection from injection attacks is easily worked around, so they just seem to stop you from even using a command with spaces in it.
Do you think the quotes in check-cmd can be safely dropped? Am I missing something obvious that would cause that to break existing behavior?
I considered wrapping the check command into a script that I bake into an image derived from progrium/docker-consul, but then I have to bake in the creds too. The way check-cmd is invoked, I have no mechanism to provide the creds as "check hints" in the monitored container.
I was able to work around the problem by using this env var:
SERVICE_5672_SCRIPT=docker run --rm --net container:rabbitmq nadt.net/nadt-rabbitmq:latest /usr/bin/rabbitmqadmin -u 'monitor' -p 'foobar' show overview
But that's just noisier than it should be.
An alternate approach would be to add more hints to registrator; something like:
SERVICE_5672_CHECK_ENV=RABBIT_USER=monitor;RABBIT_PASS=foobar
and somehow turn those into the equivalent -e
switches passed to the docker run
that calls check-cmd. I'm not sure that's any better or more secure, but it definitely makes registrator's env var parsing more complex and creates an edge case around the join character - what if an env var needs to contain a semicolon, etc....
Cheers
Rather than using an IP link as shown in the README, we can now use the "/etc/hosts"-based dockerlinks feature e.g.: to link node2 with node1 without having to inspect the IP address from node1:
docker run -d --name node2 --link node1:node1 -h node2 progrium/consul -server -join node1
Hi
Is it possible to -join multiple ips?
Hi
Not sure if this is a bug or not.
CoreOS 440, progrium/consul:latest
This is what I see on 'top':
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
23475 root 20 0 40.466g 16432 12488 S 0.3 0.4 0:39.27 consul
I know that many times VIRT can be large for various reasons (file mapping, etc), but 40g seems like a bit too much.
What do you think?
Currently docker --rm switch for transient containers is ignored when using cmd:run approach (ie docker run --rm progrium/consul cmd:run).
How do I make the DNS FQDNs of my docker consul nodes/services accessible to all my host machines?
Jonathans-MacBook-Pro:servicetown jonathan$ docker run -p 8400:8400 -p 8500:8500 -p 8600:53/udp -h node1 progrium/consul -server -bootstrap
==> WARNING: Bootstrap mode enabled! Do not enable unless necessary
==> WARNING: It is highly recommended to set GOMAXPROCS higher than 1
==> Starting Consul agent...
==================== FROM ANOTHER SHELL ==========================
Jonathans-MacBook-Pro:servicetown jonathan$ curl localhost:8500/v1/catalog/nodes
curl: (7) Failed to connect to localhost port 8500: Connection refused
Hello,
This is not clear to me where the consul configuration should be updated.
I would like to use DNS forwarding feature.
Thanks
using registrator to run a health check against a URL using curl.
however the check fails because the outgoing port is somehow blocked from inside the consul container?!
I've been unable to identify why the URL is not reachable form inside the consul container but is reachable from anywhere else - the host, other containers in the same host and on other hosts.
We're running AWS EC2 instances within a VPC, with many Docker containers running under [mainly] Docker 1.3.3. We're finding that the health-checks within the Consul container are not terminating properly, and leaving behind defunct/zombie processes. In some containers, there are only a few zombies, on other machines, it can run into the 1000s, probably because "interval" is 5 seconds; this is causing the process list to fill over few days, which in turn locks the Docker host machine. If the process list hasn't completely filled, killing the Consul container terminates all these zombies and frees the resources, but obviously this is not an optimal solution. We're running the latest Consul 0.4.1 in the progrium/consul image from Dockerhub. Consul has been amazing so far, and this is the only "glaring" issue we've run into. Anyone have any ideas? Here is an example healthcheck .json file, similar to one we're using:
{
"id": "fooid",
"name": "fooname",
"tags": [
"footag"
],
"port": 12345
,
"check": {
"script": "/opt/consul/foo_health.sh 12.34.56.78 12345",
"interval": "5s",
"note": ""
}
}
The bash script we're using looks like:
#!/bin/bash
IP=$1
PORT=$2
OUT=`curl -s --data-binary '{"jsonrpc": "1.0", "id":"healthcheck", "method": "getwork", "params": [] }' -H 'content-type: text/plain;' http://ziftr:abc123@$IP:$PORT/`
echo $OUT | grep '"error":null' > /dev/null
if [ $? -gt 0 ]; then
echo $OUT | sed -n -e 's/.*\"error":{\([^}]\+\).*/\1/p' | sed -n -e 's/.*\"message":"\([^"]\+\).*/\1/p'
echo && echo 'Raw message:' $OUT
exit 2
else
echo OK
exit 0
fi
in the README.me it mentions "You can also manually reset the cache." -> If there is a reproducible solution, can you please supply the exact command line that will perform this? Is this something that will need to only be run on the host that went through a consul restart, or any other hosts in the cluster? does it need to be run in the docker container networking namespace (e.g using nsenter --net) or simply on the host?
I've been running into this issue and trying the following things on the host doesn't seem to help (either after stopping docker-consul and/or the docker daemon, or while either one of them is still running). only thing that works is waiting a few minutes, but thats not an acceptable solution for my environment atm.
ip -s -s neigh flush all
or
nsenter --target ${CONSUL_DOCKER_PID} --net ip -s -s neigh flush all
or
arp -i docker0 -d <docker0_ip_addr_of_consul_container>
The setting about start node in production environment is:
$ docker run -d -h node1 -v /mnt:/data \
-p 10.0.1.1:8300:8300 \
-p 10.0.1.1:8301:8301 \
-p 10.0.1.1:8301:8301/udp \
-p 10.0.1.1:8302:8302 \
-p 10.0.1.1:8302:8302/udp \
-p 10.0.1.1:8400:8400 \
-p 10.0.1.1:8500:8500 \
-p 10.0.1.1:8600:53/udp \
progrium/consul -server -advertise 10.0.1.1 -bootstrap-expect 3
Should be:
$ docker run -d -h node1 -v /mnt:/data \
-p 10.0.1.1:8300:8300 \
-p 10.0.1.1:8301:8301 \
-p 10.0.1.1:8301:8301/udp \
-p 10.0.1.1:8302:8302 \
-p 10.0.1.1:8302:8302/udp \
-p 10.0.1.1:8400:8400 \
-p 10.0.1.1:8500:8500 \
-p 172.17.42.1:53:53/udp \
progrium/consul -server -advertise 10.0.1.1 -bootstrap-expect 3
Otherwise got "no leader" error.
See the issue: hashicorp/consul#372
Hi.
On AWS, the cluster takes its own container IP as cluster address. This leads to a strange behaviour when using registrator, ambassadord and other services.
Is there any additional configuration step for running the container on AWS ?
root@ip-10-209-199-14:~# ifconfig
docker0 Link encap:Ethernet HWaddr 00:00:00:00:00:00
inet addr:172.17.42.1 Bcast:0.0.0.0 Mask:255.255.0.0
eth0 Link encap:Ethernet HWaddr 22:00:0a:d1:c7:0e
inet addr:10.209.199.14 Bcast:10.209.199.63 Mask:255.255.255.192
root@ip-10-209-199-14:~# sudo docker run -h $HOSTNAME -v /mnt:/data -ti -p 10.209.199.14:8300:8300 -p 10.209.199.14:8301:8301 -p 10.209.199.14:8301:8301/udp -p 10.209.199.14:8302:8302 -p 10.209.199.14:8302:8302/udp -p 10.209.199.14:8400:8400 -p 10.209.199.14:8500:8500 -p 172.17.42.1:53:53/udp progrium/consul -server -bootstrap
==> WARNING: Bootstrap mode enabled! Do not enable unless necessary
==> WARNING: It is highly recommended to set GOMAXPROCS higher than 1
==> Starting Consul agent...
==> Starting Consul agent RPC...
==> Consul agent running!
Node name: 'ip-10-209-199-14'
Datacenter: 'dc1'
Server: true (bootstrap: true)
Client Addr: 0.0.0.0 (HTTP: 8500, DNS: 53, RPC: 8400)
Cluster Addr: 172.17.0.16 (LAN: 8301, WAN: 8302)
Gossip encrypt: false, RPC-TLS: false, TLS-Incoming: false
root@ip-10-209-199-14:~# curl http://172.17.0.16:8500/v1/catalog/service/consul
[{"Node":"ip-10-209-199-14","Address":"172.17.0.16","ServiceID":"consul","ServiceName":"consul","ServiceTags":[],"ServicePort":8300}]
Since bootstrapping works differently for Consul 0.3.1 all the references to bootstrapping can be changed now. You will no longer need to restart the 2nd node, for example.
Hello,
We have tried running a consul cluster at Amazon using m3.large instances. (that allow for moderate network performance)
The machines are running RHEL 6.5 with Docker 1.3.1 using the Consul Docker image.
The 3 machines joined a cluster, yet it was completely unstable.
It kept thrashing and the state of the members was constantly shifting.
please keep in mind that this was false, since the instances of the machines and the Consul were perfectly stable and running.
We have not found any relevant documentation that would allow up to investigate and identify the cause for this.
Even if specifying leave_on_terminate
in the Consul configuration (#34).
When stopping the container, the consul agent does not gracefully leave the cluster. It appears to be with the start
script no propagating the signals to the Consul process (I think).
What is that "Failed to check for updates" to the last line ?
lsoave@basenode:~$ sudo docker run -p 8400:8400 -p 8500:8500 -p 8600:53/udp -h node1 progrium/consul -server -bootstrap ==> WARNING: Bootstrap mode enabled! Do not enable unless necessary ==> WARNING: It is highly recommended to set GOMAXPROCS higher than 1 ==> Starting Consul agent... ==> Starting Consul agent RPC... ==> Consul agent running! Node name: 'node1' Datacenter: 'dc1' Server: true (bootstrap: true) Client Addr: 0.0.0.0 (HTTP: 8500, DNS: 53, RPC: 8400) Cluster Addr: 172.17.0.3 (LAN: 8301, WAN: 8302) Gossip encrypt: false, RPC-TLS: false, TLS-Incoming: false ==> Log data will now stream in as it occurs: 2014/11/21 23:30:14 [INFO] serf: EventMemberJoin: node1 172.17.0.3 2014/11/21 23:30:14 [INFO] serf: EventMemberJoin: node1.dc1 172.17.0.3 2014/11/21 23:30:14 [INFO] raft: Node at 172.17.0.3:8300 [Follower] entering Follower state 2014/11/21 23:30:14 [INFO] consul: adding server node1 (Addr: 172.17.0.3:8300) (DC: dc1) 2014/11/21 23:30:14 [INFO] consul: adding server node1.dc1 (Addr: 172.17.0.3:8300) (DC: dc1) 2014/11/21 23:30:14 [ERR] agent: failed to sync remote state: No cluster leader 2014/11/21 23:30:15 [WARN] raft: Heartbeat timeout reached, starting election 2014/11/21 23:30:15 [INFO] raft: Node at 172.17.0.3:8300 [Candidate] entering Candidate state 2014/11/21 23:30:16 [INFO] raft: Election won. Tally: 1 2014/11/21 23:30:16 [INFO] raft: Node at 172.17.0.3:8300 [Leader] entering Leader state 2014/11/21 23:30:16 [INFO] consul: cluster leadership acquired 2014/11/21 23:30:16 [INFO] consul: New leader elected: node1 2014/11/21 23:30:16 [INFO] raft: Disabling EnableSingleNode (bootstrap) 2014/11/21 23:30:16 [INFO] consul: member 'node1' joined, marking health alive 2014/11/21 23:30:16 [INFO] agent: Synced service 'consul' ==> Failed to check for updates: Get https://checkpoint-api.hashicorp.com/v1/check/consul?arch=amd64&os=linux&signature=4f33e3ab-7ef7-7220-5d0d-4844e9a08500&version=0.4.1: x509: failed to load system roots and no roots provided
via @armon
I realize that this maybe an issue with fig/boot2docker more than docker-consul. But thought I start here to collect info in order to open a bug report.
I can run docker consul just fine from boot2docker on the command line
$docker run -t -i -p 8400:8400 -p 8500:8500 -p 8600:53/udp -name consul progrium/consul -server -bootstrap
However if I try to autumate running consul and registrator in a dev enviroment using fig and this file.
consul:
ports:
- "8400:8400"
- "8500:8500"
- "8600:53/udp"
image: progrium/consul
hostname: node1
name: consul
command: -server -bootstrap
registrator:
links:
- consul:consul
hostname: dev.6sense.com
image: progrium/registrator
command: consul://consul:8500 --ttl 500
volumes:
- /var/run/docker.sock:/tmp/docker.sock
I get the following errors
consul_1 | 2014/11/09 02:08:04 [ERR] raft: Failed to make RequestVote RPC to 172.17.0.89:8300: dial tcp 172.17.0.89:8300: no route to host
consul_1 | 2014/11/09 02:08:04 [ERR] raft: Failed to make RequestVote RPC to 172.17.0.89:8300: dial tcp 172.17.0.89:8300: no route to host
consul_1 | 2014/11/09 02:08:04 [ERR] raft: Failed to make RequestVote RPC to 172.17.0.89:8300: dial tcp 172.17.0.89:8300: no route to host
consul_1 | 2014/11/09 02:08:05 [ERR] http: Request /v1/health/service/demanda?passing=1&tag=production&wait=60000ms, error: No cluster leader
consul_1 | 2014/11/09 02:08:05 [ERR] http: Request /v1/health/service/demanda?passing=1&tag=staging&wait=60000ms, error: No cluster leader
consul_1 | 2014/11/09 02:08:05 [ERR] http: Request /v1/health/service/dataapi?passing=1&tag=staging&wait=60000ms, error: No cluster leader
consul_1 | 2014/11/09 02:08:05 [ERR] http: Request /v1/health/service/ui?passing=1&tag=production&wait=60000ms, error: No cluster leader
consul_1 | 2014/11/09 02:08:05 [ERR] http: Request /v1/health/service/ui?passing=1&tag=staging&wait=60000ms, error: No cluster leader
consul_1 | 2014/11/09 02:08:05 [WARN] raft: Election timeout reached, restarting election
consul_1 | 2014/11/09 02:08:05 [INFO] raft: Node at 172.17.0.93:8300 [Candidate] entering Candidate state
consul_1 | 2014/11/09 02:08:07 [WARN] raft: Election timeout reached, restarting election
consul_1 | 2014/11/09 02:08:07 [INFO] raft: Node at 172.17.0.93:8300 [Candidate] entering Candidate state
consul_1 | 2014/11/09 02:08:07 [ERR] raft: Failed to make RequestVote RPC to 172.17.0.89:8300: dial tcp 172.17.0.89:8300: no route to host
consul_1 | 2014/11/09 02:08:07 [ERR] raft: Failed to make RequestVote RPC to 172.17.0.89:8300: dial tcp 172.17.0.89:8300: no route to host
consul_1 | 2014/11/09 02:08:08 [WARN] raft: Election timeout reached, restarting election
consul_1 | 2014/11/09 02:08:08 [INFO] raft: Node at 172.17.0.93:8300 [Candidate] entering Candidate state
consul_1 | 2014/11/09 02:08:09 [WARN] raft: Election timeout reached, restarting election
consul_1 | 2014/11/09 02:08:09 [INFO] raft: Node at 172.17.0.93:8300 [Candidate] entering Candidate state
Is this related to the ARP issue?
The dash shell (/bin/sh
) doesn't allow hyphens in function names.
Changing the main function name in start
from cmd-run
to cmd_run
should allow running with this shell.
When I want to use the cli I run:
docker run --rm -e CONSUL_RPC_ADDR=addr:8400 -ti --entrypoint /bin/bash progrium/consul
Then I can run consul members
.
Is there an easier way or could you support this in your start script?
I am not a curl
expert by any means, but when testing using the check_http
command, I'm seeing a number of exceptions in my spring-boot application related to the health check.
The exceptions are the result of a broken pipe when trying to write the error response when the health check is failing. I can't replicate the issue when testing from my host machine and when running the healthcheck using the Consul container.
I found that if I removed the --fail
option from the curl
command, the exceptions would go away when the health check ran. Alternatively I can leave the --fail
and add a && sleep 1
to the end of the bash command in check-http
.
So it appears that with the --fail
option, curl
is returning before the entire response has been read, which allows the Docker container to shutdown, thus closing the connection, resulting in the broken pipe.
I have a cluster of several nodes running Consul. The cluster has the consul "client" agent running on all nodes. These client agents stop when they detect a server agent booting up on the machine.
Right now, I have the "stop" command (in Systemd) to be docker exec consul-client consul leave
. This way, the agents leave the cluster gracefully, without showing up as failed
.
However, since Consul is PID 1 in the Docker container, this means Docker automatically stops (no problem there), but also returns exit code 255
.
I can let Systemd know that this is a valid exit code, but of course it's not always valid, if Consul ever crashes, 255 would be an invalid exit code.
Any thoughts on how to best handle this use-case?
Consul 0.3.1 is now available: http://www.consul.io/downloads.html and includes the following:
FEATURES:
BUG FIXES:
IMPROVEMENTS:
This is a somewhat convoluted issue, please bear with me.
It is related to these issues:
hashicorp/consul#602
hashicorp/consul#724
When running on an environment such as AWS, you may need (as i do) to be able to resolve both the Amazon servers (for example) and the services that are registered at the Consul cluster.
(from inside other containers)
There is an issue with the Consul Docker image, where you must run the container using --net="host".
otherwise, communication is unstable.
Your local Consul-agent is used as DNS, but recurse to 8.8.8.8 - an issue that was addressed in the link above.
So now you could use -recursor=[internal-network-DNS], and you would be able to resolve both
ec2.internal (for example) and service.consul.
great! right?
but wait! you are using --net="host", so you are getting your container's resolve.conf file from the host!
and in that file, the consul search domain, and the localhost nameserver are not configured. (isolation!)
AND you can NOT use the -dns and -dns-search flags!
(and port 53 in now occupied)
so, for this feature to actually work, you will need to either get the Consul-Docker container to work without using --net="host" OR allow the -dns-search flag so you could modify the /etc/resolve.conf file using the "docker run" command.
(or hack it in the Dockerfile using env vars and bash)
otherwise this will require you to modify the resolve.conf on each and every host you are running the Consul agent on, to something like:
"search ec2.internal service.consul
nameserver 127.0.0.1
nameserver [internal-network-DNS]"
which, of course, goes against the entire Docker concept of containment / isolation / run anywhere etc.
When trying to install any package - fail with the error "can't open '/lib/functions.sh'".
Though, I've checked the sources and don't understand how it could disappear.
docker run --rm -it --entrypoint /bin/sh progrium/consul
/ # opkg-install <anything>
...
Configuring terminfo.
//usr/lib/opkg/info/terminfo.postinst: .: line 3: can't open '/lib/functions.sh'
And a lot of other strings with the same error.
When doing "docker stop" on a container, the consul process will receive a SIGTERM. However, the consul agent does not leave the cluster gracefully upon SIGTERM by default. This results in problems when the node is restarted because it is assigned a new IP address. This then triggers this bug: hashicorp/consul#457.
By adding leave_on_terminate = true to the consul configuration in /config/consul.json
a docker stop
command would be treated by consul as a graceful shutdown of that agent.
I noticed that when I run the consul container, it actually ignores all my SIGINT or SIGTERM. For example
docker run --name="foobar" --rm progrium/consul
As you press Ctrl+C, and it doesn't stop. I checked docker source, confirmed that there is signal proxying. You can even try to kill it by
docker kill -S INT foobar
Still, it doesn't stop. The running program is actually /bin/bash
, and it has a subprocess consul
. I am not pretty sure how bash handles signal, but anyway, it simply ignores interrupts.
NB: cross posed in the consul google group and progrium/registrator#97
, for consistency.
It seems as though any health check I register, be it through the agent (attached to a service) or though the catalog), or even through progrium/registrator
leads to a health check interval that is seemingly ignored.
E.g.
curl --include --request PUT --data-binary "{
\"ID\": \"nginx1\",
\"Name\": \"nginx\",
\"Port\": 80,
\"Check\": {
\"Name\": \"Nginx health check\",
\"Notes\": \"Script based health check\",
\"Status\": \"unknown\",
\"Script\": \"curl $IP/webhook/health\",
\"Interval\": \"1m\"
}
}" $IP:8500/v1/agent/service/register
This service will get hit on /webhook/health
every five minutes, instead of one minute (or whatever value of Interval
I happen to set).
Hi,
I want to change the consul domain which isn't possible with this docker image.
I'm happy to contribute the necessary changes, just let me know if you have some ideas how.
I would probably modify the json as part of the run script or something..?
Hi there and thanks a lot for this.
When I run as documented:
docker run -p 8400:8400 -p 8500:8500 -p 8600:53/udp -h node1 progrium/consul -server -bootstrap
Then consul keeps waiting for other nodes:
curl localhost:8500/v1/catalog/nodes
No cluster leader
This command seems to fix the problem:
docker run -p 8400:8400 -p 8500:8500 -p 8600:53/udp -h node1 progrium/consul -server -bootstrap-expect 1
Is this the right way to start a single instance?
Consul uses serf and serfs uses these ports in upd and tcp, in the list of params you have to add again the 8301 and 8302 with /udp if not there will be connectivity issues
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.