gliderlabs / docker-consul Goto Github PK

View Code? Open in Web Editor NEW

1.1K 1.1K 287.0 64 KB

Dockerized Consul

License: MIT License

Makefile 23.94% Shell 76.06%

docker-consul's People

Contributors

Stargazers

Watchers

Forkers

rayleyva sejeff denkhaus dennybaa bryanl topochan pinterb fangpenlin andrewwatson chesleybrown trickbooter stigkj binocarlos drnic nidkil sstarcher dowens bellycard tiw boldfield jasonparekh pmenglund alex-docker liquidweaver remmelt ubech sequenceiq wolfspyre ionic-team baseboxorg glennswest mtsuk cveira jtperry charlesakalugwu skippy sneakyz justinclayton noumansaleem shawnsi shufo roykk ttakezawa ajohnstone zollie adkatrit ficusio n0n0x 7imbrook tanukkii007 cobjet-docker sghazzawi redmar vimond tonilap pniederlag jorisroovers aatarasoff aidanhs leochencipher oba11 jmilkiewicz nimblestratus epipho amaliog bigblue scollison amitmund timosaikkonen minichate florentvaldelievre amos6224 ygaller bkono identakid supahswank rtnpro kosta-github hoist marconi mvanholsteijn adsabs mattrobenolt ghicks-rmn wehkamp thachmai jacobat adamdz bridg rschmukler trustnoone cleawing reaqta brianadams alanma cristidumitru vmatekole seeder nabeken ccll

docker-consul's Issues

Feature Request: watch script, send HUP to named container

We are building up an automation platform using registrator, and consul, and one use case uses a docker kill --signal=HUP sent to an nginx container to reload an automatically generated LB config, whenever a K/V is changed in consul.

We had consul implemented on the base hosts, and used a watch in consul to call a local script which sent the signal, something like
docker kill --signal="HUP" loadbalancer

Would like to be able to do the same with containerized consul agents.
Since we can mount the docker socket, and the docker binary is already available, would this be as simple as creating a new shell script such as this
#!/bin/bash docker kill --signal="HUP" loadbalancer

Perhaps this would be more generalized, so we could pass the name of the container to be HUPED?, or perhaps even another parameter for the signal name?

Am going to proceed with this, but wondering if you have better concepts? Perhaps this is something you have already considered and rejected, or do another way?

Consul service registration and DNS

First of all, if I run:

$ docker run -d -p 8400:8400 -p 8500:8500 -p 8600:53/udp -h consul --name=consul progrium/consul -server -bootstrap -ui-dir /ui
$ curl http://$(docker-machine ip local):8500/v1/catalog/service/consul
[{"Node":"consul","Address":"172.17.0.17","ServiceID":"consul","ServiceName":"consul","ServiceTags":[],"ServiceAddress":"","ServicePort":8300}]

Why is "ServiceAddress" blank?

Then, when I run one of my containers:

cat /etc/resolv.conf

nameserver 172.17.42.1
nameserver 8.8.8.8
search service.consul

But:

ping consul

ping: unknown host consul

I noticed in Dockerfile (master), its using docker 1.5. I am using 1.6. So I tried my own image build with 1.6 but I get the same results.

I can ping:

ping 172.17.42.1

PING 172.17.42.1 (172.17.42.1) 56(84) bytes of data.
64 bytes from 172.17.42.1: icmp_seq=1 ttl=64 time=0.144 ms

and when I use registrator and run another service, that service does resolve properly. So it seems that the consul container itself is the one having problems.

** I am on OSX, using docker-machine.

Health Check ack failing

I imagine this is something I'm doing wrong, but I can't seem to get my health checks to pass.
The 2 consul agents are running on different servers and those servers can access all ports of each other on both UDP and TCP.

Server Logs:
2014/09/07 00:34:58 [INFO] consul: member 'ip-10-0-10-254' joined, marking health alive
2014/09/07 00:34:59 [INFO] memberlist: Suspect ip-10-0-10-254 has failed, no acks received
2014/09/07 00:35:01 [INFO] memberlist: Suspect ip-10-0-10-254 has failed, no acks received
2014/09/07 00:35:03 [INFO] memberlist: Suspect ip-10-0-10-254 has failed, no acks received
2014/09/07 00:35:04 [INFO] memberlist: Marking ip-10-0-10-254 as failed, suspect timeout reached
2014/09/07 00:35:04 [INFO] serf: EventMemberFailed: ip-10-0-10-254 10.0.10.254
2014/09/07 00:35:04 [INFO] consul: removing server ip-10-0-10-254 (Addr: 10.0.10.254:8300) (DC: dc1)
2014/09/07 00:35:04 [INFO] consul: member 'ip-10-0-10-254' failed, marking health critical

Client Logs:
2014/09/07 00:32:28 [INFO] serf: EventMemberJoin: ip-10-0-20-6 10.0.20.6
2014/09/07 00:32:28 [INFO] consul: adding server ip-10-0-20-6 (Addr: 10.0.20.6:8300) (DC: dc1)
2014/09/07 00:32:29 [INFO] memberlist: Suspect ip-10-0-20-6 has failed, no acks received
2014/09/07 00:32:31 [INFO] memberlist: Suspect ip-10-0-20-6 has failed, no acks received
2014/09/07 00:32:33 [INFO] memberlist: Suspect ip-10-0-20-6 has failed, no acks received
2014/09/07 00:32:34 [INFO] memberlist: Marking ip-10-0-20-6 as failed, suspect timeout reached
2014/09/07 00:32:34 [INFO] serf: EventMemberFailed: ip-10-0-20-6 10.0.20.6
2014/09/07 00:32:34 [INFO] consul: removing server ip-10-0-20-6 (Addr: 10.0.20.6:8300) (DC: dc1)

Create consul-0.5 on docker hub

latest tag at https://registry.hub.docker.com/u/progrium/consul/ was following amazing quickly the new release of consul 0.5.0. thanks for that!

Would be nice to be able to reference explicitly by tag docker pull progrium/consul:consul-0.5

error when using atlas-join

Got this error scada-client: failed to dial: x509: failed to load system roots and no roots provided when trying to use -atlas and -atlas-join.
Thanks.

cmd:run for client nodes

I want to use the awesome cmd:run feature to bootstrap a client node.

I've been trying to think how to neatly slot that in to the cmd:run command without getting in the way of the docker arguments.

The best I can come up with is to append ::client after the join-ip on the assumption that clients would specify both an advertise and join ip (as opposed to the first server which just has advertise ip but therefore no client flag needed).

So something like this:

cmd:run 10.0.1.1::10.0.1.2::client -d -v /mnt:/data

Then the -server flag can be filtered out if the client flag exists.

Do you have a preference for if/how this should be done. I didn't want to just open a pull-request assuming its a good idea.

P.S this whole docker consul world you are building is brilliant

How to do health checks on host node?

If I wanted to create a script to do memory usage or pgrep to make sure something was running on the node, how would I do that here?

Docker-consul container cannot reelect himself after restart

Consul is running in single mode:
docker run -d -p 8400:8400 -p 8500:8500 -p 8600:53/udp -h node1 progrium/consul -server -bootstrap-expect 1

so after restart it seems like Consul cannot find himself:

    2014/09/15 16:46:20 [ERR] raft: Failed to make RequestVote RPC to 172.17.0.12:8300: dial tcp 172.17.0.12:8300: no route to host
    2014/09/15 16:46:21 [WARN] raft: Election timeout reached, restarting election
    2014/09/15 16:46:21 [INFO] raft: Node at 172.17.0.13:8300 [Candidate] entering Candidate state
    2014/09/15 16:46:23 [WARN] raft: Election timeout reached, restarting election
    2014/09/15 16:46:23 [INFO] raft: Node at 172.17.0.13:8300 [Candidate] entering Candidate state
    2014/09/15 16:46:24 [ERR] raft: Failed to make RequestVote RPC to 172.17.0.12:8300: dial tcp 172.17.0.12:8300: i/o timeout
    2014/09/15 16:46:24 [ERR] raft: Failed to make RequestVote RPC to 172.17.0.12:8300: dial tcp 172.17.0.12:8300: no route to host
    2014/09/15 16:46:24 [ERR] raft: Failed to make RequestVote RPC to 172.17.0.12:8300: dial tcp 172.17.0.12:8300: no route to host
    2014/09/15 16:46:24 [WARN] raft: Election timeout reached, restarting election
    2014/09/15 16:46:24 [INFO] raft: Node at 172.17.0.13:8300 [Candidate] entering Candidate state
    2014/09/15 16:46:25 [ERR] agent: failed to sync remote state: No cluster leader
    2014/09/15 16:46:26 [WARN] raft: Election timeout reached, restarting election
    2014/09/15 16:46:26 [INFO] raft: Node at 172.17.0.13:8300 [Candidate] entering Candidate state
    2014/09/15 16:46:27 [ERR] raft: Failed to make RequestVote RPC to 172.17.0.12:8300: dial tcp 172.17.0.12:8300: no route to host
    2014/09/15 16:46:27 [ERR] raft: Failed to make RequestVote RPC to 172.17.0.12:8300: dial tcp 172.17.0.12:8300: no route to host

How to run check-http health checks

I tried adding a health check via the HTTP API (is there another way?).

curl -XPUT -d '{"Name": "test", "Check": { "script": "docker run --rm -v /var/run/docker.sock:/var/run/docker.sock --entrypoint /bin/bash progrium/consul check-http e1d219c59ded 2368 / -L", "interval": "10s" }}' http://127.0.0.1:8500/v1/agent/service/register -v

Here's the result:

time="2015-03-29T04:02:14Z" level="fatal" msg="Post http:///var/run/docker.sock/v1.17/containers/create: dial unix /var/run/docker.sock: no such file or directory. Are you trying to connect to a TLS-enabled daemon without TLS?"

curl: (7) Failed connect to localhost:8500

Hi, i have pulled consul with "docker pull progrium/consul" and started with docker-compose "docker-compose up -d consul":

consul:
image: docker.io/progrium/consul
command: run -p 8400:8400 -p 8500:8500 -p 8600:53/udp -h node1 progrium/consul -server -bootstrap -ui-dir /ui

The container started and up but if i get a "curl http://localhost:8500/ui" i have "curl: (7) Failed connect to localhost:8500; Connection refused"

UPDATE
If i run "netstat -lntu" there are no demons listen on ports 8500 and 8600 and 53

client and server don't have same version

Any idea what this means?

Error response from daemon: client and server don't have same version (client : 1.17, server: 1.15)Error response from daemon: client and server don't have same version (client : 1.17, server: 1.15)docker: "run" requires a minimum of 1 argument. See 'docker run --help'.

I get it when I use the "check-cmd {container_name}".

Node is seen in active state even my docker daemon is not running.

I 've 2 hosts where I 've docker running. Consul is also running on a docker container.

[root@docker2 ~]#docker exec -it 7e74638cb211 consul members
Node Address Status Type Build Protocol
node1 172.16.131.253:8301 alive server 0.5.0 2
node2 172.16.131.252:8301 alive server 0.5.0 2

I shows both my node as alive whereas my node2 is not even running docker at the moment. Docker on this node actually kind of crashed by consul is not able to detect that.

[root@docker1 ~]# docker ps
FATA[0000] Cannot connect to the Docker daemon. Is 'docker -d' running on this host?

Please suggest.

Dependent docker images fail to build without config directory

The ONBUILD ADD ./config /config/ instruction in the Dockerfile forces child Dockerfiles to have their own config directory. This makes sense in most cases which follow the common pattern of including configuration files in a dependent image. The problem is that this prevents dependent images that are not making any changes to config from being built since the ONBUILD trigger doesn't look like it can be turned off. My example is in building an image with a specific version of Docker (right now this image has Docker 1.5.0 and CoreOS's stable release is only up to 1.4.1).

I would argue that users who want to attach their own configs to an image will know they need a command for that and won't easily miss using their own ADD/COPY statement. Since the cost of having the ONBUILD trigger is preventing other use cases from building without a workaround I ask that you consider removing it.

Strange DNS behaviour

I'm running consul with default settings: docker run -d -p 8400:8400 -p 8500:8500 -p 172.17.42.1:53:53/udp -h consul progrium/consul:latest -server -bootstrap and i have "--dns 172.17.42.1" in DOCKER_OPTS.

Then, if i issue dig @172.17.42.1 consul.node.consul from my host shell, i will get

;; ANSWER SECTION:
consul.node.consul.     0       IN      A       172.17.0.6

but docker run --rm aanand/docker-dnsutils dig consul.node.consul

times out.

AWS example

would it be worth adding documentation on how to run on AWS, like you've done with boot2docker for OSX? I may have gotten completely mixed up, but this ended up working for me:

docker --daemon \
       --host=fd:// \
       --dns 172.17.42.1 \
       --dns $(awk '/^nameserver/{print $2}' /etc/resolv.conf) \
       --dns-search service.consul \
       $DOCKER_OPTS

the first --dns is docker0 veth, so the value tends to default to 172.17.42.1. The 2nd --dns gets the aws provided nameserver. If you don't set this up, and use 8.8.8.8 per the documentation, you need to open up the :53 ports on the AWS VPC security groups and network acls.

@progrium what do you think? If this smells right, I'll submit a PR to add this to the README.md

thanks!

EDIT: removed --dns $(ip ro | awk '/docker0/{print $9}') and replaced with --dns 172.17.42.1 as the docker0 bridge doesn't exist on initial boot

Provide the binary in alternative compressions.

Hello,

I use consul with lots of Docker projects that are based on Centos and it is rather tedious and inefficient having to install unzip on every container, please provider the package in tar and relatives as well.

Yet better, release them through Github as they have a lot faster servers than your current setup.

Thanks for the great work.

Docker 1.4.0

If I up my container with docker 1.4.0 I get the following error

consul_1 | ==> WARNING: It is highly recommended to set GOMAXPROCS higher than 1
consul_1 | ==> Starting Consul agent...
consul_1 | ==> Starting Consul agent RPC...
consul_1 | ==> Joining cluster...
consul_1 | ==> Reading remote state failed: EOF
opt_consul_1 exited with code 1

If I downgrade to 1.3.3 and re-run the container it works

opkg reports "cannot install package bash", cannot rebuild with consul 0.3.1

Hi,

I am trying to rebuild the image using https://dl.bintray.com/mitchellh/consul/0.3.1_linux_amd64.zip

however the build process reports this error:

Step 6 : RUN opkg-install curl bash
 ---> Running in bc940d92ad0e
Downloading http://downloads.openwrt.org/snapshots/trunk/x86_64/packages/Packages.gz.
Inflating http://downloads.openwrt.org/snapshots/trunk/x86_64/packages/Packages.gz.
Updated list of available packages in /var/opkg-lists/snapshots.
Installing curl (7.36.0-1) to root...
Downloading http://downloads.openwrt.org/snapshots/trunk/x86_64/packages/curl_7.36.0-1_x86_64.ipk.
Installing libcurl (7.36.0-1) to root...
Downloading http://downloads.openwrt.org/snapshots/trunk/x86_64/packages/libcurl_7.36.0-1_x86_64.ipk.
Installing libpolarssl (1.3.7-1) to root...
Downloading http://downloads.openwrt.org/snapshots/trunk/x86_64/packages/libpolarssl_1.3.7-1_x86_64.ipk.
Unknown package 'bash'.
Configuring libpolarssl.
Configuring libcurl.
Configuring curl.
Collected errors:
 * opkg_install_cmd: Cannot install package bash.

When trying to execute the image, it will not run, reporting:

 Error response from daemon: Cannot start container 1745aaa4835958e2459647cd3790dfe3d96b066b74b183e8eadfbf3263fba7f2: no such file or directory

Which I believe is due to 'start' requiring /bin/bash

Do you have any idea why bash is no longer listed in http://downloads.openwrt.org/snapshots/trunk/x86_64/packages/Packages.gz and not available for download? I haven't found the build origin for snapshots/trunk yet..

Unable to run command with args via *_CHECK_CMD env vars

This problem is shared between docker-consul and registrator, but the simple fix would be applied to this repo.

When I try to set an env var for registrator like this:

SERVICE_5672_CHECK_CMD=/usr/bin/rabbitmqadmin -u 'monitor' -p 'foobar' show overview

I get errors of the form in my consul logs:

exec: "/usr/bin/rabbitmqadmin -u 'monitor' -p 'foobar' show overview": stat /usr/bin/rabbitmqadmin -u 'monitor' -p 'foobar' show overview: no such file or directory

If I emulate the command by hand by running it from the command line, I can replicate the error when putting quotes around the command:

$ docker run --rm --net container:rabbitmq nadt.net/nadt-rabbitmq:latest "/usr/bin/rabbitmqadmin -u 'monitor' -p 'foobar' show overview"
exec: "/usr/bin/rabbitmqadmin -u 'monitor' -p 'foobar' show overview": stat /usr/bin/rabbitmqadmin -u 'monitor' -p 'foobar' show overview: no such file or directory2014/10/06 23:01:40 Error response from daemon: Cannot start container 886bf462d44611b2a8282757f5d164c9ab8b78bf834269c6fd288833b66fea9e: exec: "/usr/bin/rabbitmqadmin -u 'monitor' -p 'foobar' show overview": stat /usr/bin/rabbitmqadmin -u 'monitor' -p 'foobar' show overview: no such file or directory

but if I leave them off, the check works:

$ docker run --rm --net container:rabbitmq nadt.net/nadt-rabbitmq:latest /usr/bin/rabbitmqadmin -u 'monitor' -p 'foobar' show overview
+------------------+--------------------+-----------------------+----------------------+
| rabbitmq_version |    cluster_name    | queue_totals.messages | object_totals.queues |
+------------------+--------------------+-----------------------+----------------------+
| 3.3.5            | rabbit@rabbit-nadt | 0                     | 0                    |
+------------------+--------------------+-----------------------+----------------------+

check-cmd always puts the entire command in quotes, but given that docker attempts to stat the command, I'm not sure they are serving any real purpose. Any protection from injection attacks is easily worked around, so they just seem to stop you from even using a command with spaces in it.

Do you think the quotes in check-cmd can be safely dropped? Am I missing something obvious that would cause that to break existing behavior?

I considered wrapping the check command into a script that I bake into an image derived from progrium/docker-consul, but then I have to bake in the creds too. The way check-cmd is invoked, I have no mechanism to provide the creds as "check hints" in the monitored container.

I was able to work around the problem by using this env var:

SERVICE_5672_SCRIPT=docker run --rm --net container:rabbitmq nadt.net/nadt-rabbitmq:latest /usr/bin/rabbitmqadmin -u 'monitor' -p 'foobar' show overview

But that's just noisier than it should be.

An alternate approach would be to add more hints to registrator; something like:

SERVICE_5672_CHECK_ENV=RABBIT_USER=monitor;RABBIT_PASS=foobar

and somehow turn those into the equivalent -e switches passed to the docker run that calls check-cmd. I'm not sure that's any better or more secure, but it definitely makes registrator's env var parsing more complex and creates an edge case around the join character - what if an env var needs to contain a semicolon, etc....

Cheers

Replace ip inspection with hosts-based dockerlinks

Rather than using an IP link as shown in the README, we can now use the "/etc/hosts"-based dockerlinks feature e.g.: to link node2 with node1 without having to inspect the IP address from node1:

docker run -d --name node2 --link node1:node1 -h node2 progrium/consul -server -join node1

Support multiple -join ips

Is it possible to -join multiple ips?

40g virtual memory consumption for docker-consul

Hi
Not sure if this is a bug or not.

CoreOS 440, progrium/consul:latest

This is what I see on 'top':
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
23475 root 20 0 40.466g 16432 12488 S 0.3 0.4 0:39.27 consul

I know that many times VIRT can be large for various reasons (file mapping, etc), but 40g seems like a bit too much.

What do you think?

Honor docker --rm switch with cmd:run

Currently docker --rm switch for transient containers is ignored when using cmd:run approach (ie docker run --rm progrium/consul cmd:run).

DNS on host

How do I make the DNS FQDNs of my docker consul nodes/services accessible to all my host machines?

started docker container is unreachable.

Jonathans-MacBook-Pro:servicetown jonathan$ docker run -p 8400:8400 -p 8500:8500 -p 8600:53/udp -h node1 progrium/consul -server -bootstrap
==> WARNING: Bootstrap mode enabled! Do not enable unless necessary
==> WARNING: It is highly recommended to set GOMAXPROCS higher than 1
==> Starting Consul agent...

==================== FROM ANOTHER SHELL ==========================

Jonathans-MacBook-Pro:servicetown jonathan$ curl localhost:8500/v1/catalog/nodes
curl: (7) Failed to connect to localhost port 8500: Connection refused

Custom DNS configuration

Hello,

This is not clear to me where the consul configuration should be updated.
I would like to use DNS forwarding feature.

DNS configuration should be taken into account by just updating config/consul.json ?
Do we need to run a new agent to take the DNS configuration into account ? consul reload does not seem to talk about DNS

Thanks

Health check failure due to closed ports

using registrator to run a health check against a URL using curl.
however the check fails because the outgoing port is somehow blocked from inside the consul container?!
I've been unable to identify why the URL is not reachable form inside the consul container but is reachable from anywhere else - the host, other containers in the same host and on other hosts.

defunct/zombie healthcheck processes

We're running AWS EC2 instances within a VPC, with many Docker containers running under [mainly] Docker 1.3.3. We're finding that the health-checks within the Consul container are not terminating properly, and leaving behind defunct/zombie processes. In some containers, there are only a few zombies, on other machines, it can run into the 1000s, probably because "interval" is 5 seconds; this is causing the process list to fill over few days, which in turn locks the Docker host machine. If the process list hasn't completely filled, killing the Consul container terminates all these zombies and frees the resources, but obviously this is not an optimal solution. We're running the latest Consul 0.4.1 in the progrium/consul image from Dockerhub. Consul has been amazing so far, and this is the only "glaring" issue we've run into. Anyone have any ideas? Here is an example healthcheck .json file, similar to one we're using:

{
  "id": "fooid",
  "name": "fooname",
  "tags": [
    "footag"
  ],
  "port": 12345
,
  "check": {
    "script": "/opt/consul/foo_health.sh 12.34.56.78 12345",
    "interval": "5s",
    "note": ""
  }
}

The bash script we're using looks like:

#!/bin/bash

IP=$1
PORT=$2

OUT=`curl -s --data-binary '{"jsonrpc": "1.0", "id":"healthcheck", "method": "getwork", "params": [] }'  -H 'content-type: text/plain;' http://ziftr:abc123@$IP:$PORT/`

echo $OUT | grep '"error":null' > /dev/null

if [ $? -gt 0 ]; then
  echo $OUT | sed -n -e 's/.*\"error":{\([^}]\+\).*/\1/p' | sed -n -e 's/.*\"message":"\([^"]\+\).*/\1/p'

  echo && echo 'Raw message:' $OUT
  exit 2
else
  echo OK
  exit 0
fi

ARP Cache purging docoumentation

in the README.me it mentions "You can also manually reset the cache." -> If there is a reproducible solution, can you please supply the exact command line that will perform this? Is this something that will need to only be run on the host that went through a consul restart, or any other hosts in the cluster? does it need to be run in the docker container networking namespace (e.g using nsenter --net) or simply on the host?
I've been running into this issue and trying the following things on the host doesn't seem to help (either after stopping docker-consul and/or the docker daemon, or while either one of them is still running). only thing that works is waiting a few minutes, but thats not an acceptable solution for my environment atm.

ip -s -s neigh flush all
or
nsenter --target ${CONSUL_DOCKER_PID} --net ip -s -s neigh flush all
or
arp -i docker0 -d <docker0_ip_addr_of_consul_container>

README about starting cluster in production environment

The setting about start node in production environment is:

$ docker run -d -h node1 -v /mnt:/data \
    -p 10.0.1.1:8300:8300 \
    -p 10.0.1.1:8301:8301 \
    -p 10.0.1.1:8301:8301/udp \
    -p 10.0.1.1:8302:8302 \
    -p 10.0.1.1:8302:8302/udp \
    -p 10.0.1.1:8400:8400 \
    -p 10.0.1.1:8500:8500 \
    -p 10.0.1.1:8600:53/udp \
    progrium/consul -server -advertise 10.0.1.1 -bootstrap-expect 3

Should be:

$ docker run -d -h node1 -v /mnt:/data \
    -p 10.0.1.1:8300:8300 \
    -p 10.0.1.1:8301:8301 \
    -p 10.0.1.1:8301:8301/udp \
    -p 10.0.1.1:8302:8302 \
    -p 10.0.1.1:8302:8302/udp \
    -p 10.0.1.1:8400:8400 \
    -p 10.0.1.1:8500:8500 \
    -p 172.17.42.1:53:53/udp  \  
    progrium/consul -server -advertise 10.0.1.1 -bootstrap-expect 3

Otherwise got "no leader" error.

See the issue: hashicorp/consul#372

Cluster Addr on AWS

Hi.

On AWS, the cluster takes its own container IP as cluster address. This leads to a strange behaviour when using registrator, ambassadord and other services.

Is there any additional configuration step for running the container on AWS ?

root@ip-10-209-199-14:~# ifconfig
docker0 Link encap:Ethernet HWaddr 00:00:00:00:00:00
inet addr:172.17.42.1 Bcast:0.0.0.0 Mask:255.255.0.0

eth0 Link encap:Ethernet HWaddr 22:00:0a:d1:c7:0e
inet addr:10.209.199.14 Bcast:10.209.199.63 Mask:255.255.255.192

root@ip-10-209-199-14:~# sudo docker run -h $HOSTNAME -v /mnt:/data -ti -p 10.209.199.14:8300:8300 -p 10.209.199.14:8301:8301 -p 10.209.199.14:8301:8301/udp -p 10.209.199.14:8302:8302 -p 10.209.199.14:8302:8302/udp -p 10.209.199.14:8400:8400 -p 10.209.199.14:8500:8500 -p 172.17.42.1:53:53/udp progrium/consul -server -bootstrap
==> WARNING: Bootstrap mode enabled! Do not enable unless necessary
==> WARNING: It is highly recommended to set GOMAXPROCS higher than 1
==> Starting Consul agent...
==> Starting Consul agent RPC...
==> Consul agent running!
Node name: 'ip-10-209-199-14'
Datacenter: 'dc1'
Server: true (bootstrap: true)
Client Addr: 0.0.0.0 (HTTP: 8500, DNS: 53, RPC: 8400)
Cluster Addr: 172.17.0.16 (LAN: 8301, WAN: 8302)
Gossip encrypt: false, RPC-TLS: false, TLS-Incoming: false

root@ip-10-209-199-14:~# curl http://172.17.0.16:8500/v1/catalog/service/consul
[{"Node":"ip-10-209-199-14","Address":"172.17.0.16","ServiceID":"consul","ServiceName":"consul","ServiceTags":[],"ServicePort":8300}]

Updated ReadMe.md for Consul 0.3.1

Since bootstrapping works differently for Consul 0.3.1 all the references to bootstrapping can be changed now. You will no longer need to restart the 2nd node, for example.

Docker Consul thrashing over AWS

Hello,
We have tried running a consul cluster at Amazon using m3.large instances. (that allow for moderate network performance)
The machines are running RHEL 6.5 with Docker 1.3.1 using the Consul Docker image.
The 3 machines joined a cluster, yet it was completely unstable.
It kept thrashing and the state of the members was constantly shifting.
please keep in mind that this was false, since the instances of the machines and the Consul were perfectly stable and running.

We have not found any relevant documentation that would allow up to investigate and identify the cause for this.

Consul doesn't leave cluster when stopping container

Even if specifying leave_on_terminate in the Consul configuration (#34).

When stopping the container, the consul agent does not gracefully leave the cluster. It appears to be with the start script no propagating the signals to the Consul process (I think).

Failed to check for updates

What is that "Failed to check for updates" to the last line ?

lsoave@basenode:~$ sudo docker run -p 8400:8400 -p 8500:8500 -p 8600:53/udp -h node1 progrium/consul -server -bootstrap
==> WARNING: Bootstrap mode enabled! Do not enable unless necessary
==> WARNING: It is highly recommended to set GOMAXPROCS higher than 1
==> Starting Consul agent...
==> Starting Consul agent RPC...
==> Consul agent running!
         Node name: 'node1'
        Datacenter: 'dc1'
            Server: true (bootstrap: true)
       Client Addr: 0.0.0.0 (HTTP: 8500, DNS: 53, RPC: 8400)
      Cluster Addr: 172.17.0.3 (LAN: 8301, WAN: 8302)
    Gossip encrypt: false, RPC-TLS: false, TLS-Incoming: false

==> Log data will now stream in as it occurs:

    2014/11/21 23:30:14 [INFO] serf: EventMemberJoin: node1 172.17.0.3
    2014/11/21 23:30:14 [INFO] serf: EventMemberJoin: node1.dc1 172.17.0.3
    2014/11/21 23:30:14 [INFO] raft: Node at 172.17.0.3:8300 [Follower] entering Follower state
    2014/11/21 23:30:14 [INFO] consul: adding server node1 (Addr: 172.17.0.3:8300) (DC: dc1)
    2014/11/21 23:30:14 [INFO] consul: adding server node1.dc1 (Addr: 172.17.0.3:8300) (DC: dc1)
    2014/11/21 23:30:14 [ERR] agent: failed to sync remote state: No cluster leader
    2014/11/21 23:30:15 [WARN] raft: Heartbeat timeout reached, starting election
    2014/11/21 23:30:15 [INFO] raft: Node at 172.17.0.3:8300 [Candidate] entering Candidate state
    2014/11/21 23:30:16 [INFO] raft: Election won. Tally: 1
    2014/11/21 23:30:16 [INFO] raft: Node at 172.17.0.3:8300 [Leader] entering Leader state
    2014/11/21 23:30:16 [INFO] consul: cluster leadership acquired
    2014/11/21 23:30:16 [INFO] consul: New leader elected: node1
    2014/11/21 23:30:16 [INFO] raft: Disabling EnableSingleNode (bootstrap)
    2014/11/21 23:30:16 [INFO] consul: member 'node1' joined, marking health alive
    2014/11/21 23:30:16 [INFO] agent: Synced service 'consul'
==> Failed to check for updates: Get https://checkpoint-api.hashicorp.com/v1/check/consul?arch=amd64&os=linux&signature=4f33e3ab-7ef7-7220-5d0d-4844e9a08500&version=0.4.1: x509: failed to load system roots and no roots provided

include ssl roots in image for consul update check

via @armon

can't run under fig

I realize that this maybe an issue with fig/boot2docker more than docker-consul. But thought I start here to collect info in order to open a bug report.

I can run docker consul just fine from boot2docker on the command line

$docker run -t -i  -p 8400:8400 -p 8500:8500 -p 8600:53/udp -name consul progrium/consul   -server  -bootstrap

However if I try to autumate running consul and registrator in a dev enviroment using fig and this file.

consul:
  ports:
    - "8400:8400"
    - "8500:8500"
    - "8600:53/udp"
  image: progrium/consul
  hostname: node1
  name: consul
  command: -server  -bootstrap

registrator:
  links:
    - consul:consul
  hostname: dev.6sense.com
  image: progrium/registrator
  command: consul://consul:8500 --ttl 500
  volumes:
    - /var/run/docker.sock:/tmp/docker.sock

I get the following errors

consul_1      |     2014/11/09 02:08:04 [ERR] raft: Failed to make RequestVote RPC to 172.17.0.89:8300: dial tcp 172.17.0.89:8300: no route to host
consul_1      |     2014/11/09 02:08:04 [ERR] raft: Failed to make RequestVote RPC to 172.17.0.89:8300: dial tcp 172.17.0.89:8300: no route to host
consul_1      |     2014/11/09 02:08:04 [ERR] raft: Failed to make RequestVote RPC to 172.17.0.89:8300: dial tcp 172.17.0.89:8300: no route to host
consul_1      |     2014/11/09 02:08:05 [ERR] http: Request /v1/health/service/demanda?passing=1&tag=production&wait=60000ms, error: No cluster leader
consul_1      |     2014/11/09 02:08:05 [ERR] http: Request /v1/health/service/demanda?passing=1&tag=staging&wait=60000ms, error: No cluster leader
consul_1      |     2014/11/09 02:08:05 [ERR] http: Request /v1/health/service/dataapi?passing=1&tag=staging&wait=60000ms, error: No cluster leader
consul_1      |     2014/11/09 02:08:05 [ERR] http: Request /v1/health/service/ui?passing=1&tag=production&wait=60000ms, error: No cluster leader
consul_1      |     2014/11/09 02:08:05 [ERR] http: Request /v1/health/service/ui?passing=1&tag=staging&wait=60000ms, error: No cluster leader
consul_1      |     2014/11/09 02:08:05 [WARN] raft: Election timeout reached, restarting election
consul_1      |     2014/11/09 02:08:05 [INFO] raft: Node at 172.17.0.93:8300 [Candidate] entering Candidate state
consul_1      |     2014/11/09 02:08:07 [WARN] raft: Election timeout reached, restarting election
consul_1      |     2014/11/09 02:08:07 [INFO] raft: Node at 172.17.0.93:8300 [Candidate] entering Candidate state
consul_1      |     2014/11/09 02:08:07 [ERR] raft: Failed to make RequestVote RPC to 172.17.0.89:8300: dial tcp 172.17.0.89:8300: no route to host
consul_1      |     2014/11/09 02:08:07 [ERR] raft: Failed to make RequestVote RPC to 172.17.0.89:8300: dial tcp 172.17.0.89:8300: no route to host
consul_1      |     2014/11/09 02:08:08 [WARN] raft: Election timeout reached, restarting election
consul_1      |     2014/11/09 02:08:08 [INFO] raft: Node at 172.17.0.93:8300 [Candidate] entering Candidate state
consul_1      |     2014/11/09 02:08:09 [WARN] raft: Election timeout reached, restarting election
consul_1      |     2014/11/09 02:08:09 [INFO] raft: Node at 172.17.0.93:8300 [Candidate] entering Candidate state

Is this related to the ARP issue?

`/bin/start: line 6: syntax error: bad function name` when changing entrypoint to `/bin/sh`

The dash shell (/bin/sh) doesn't allow hyphens in function names.

Changing the main function name in start from cmd-run to cmd_run should allow running with this shell.

Consul CLI

When I want to use the cli I run:

docker run --rm -e CONSUL_RPC_ADDR=addr:8400 -ti --entrypoint /bin/bash progrium/consul

Then I can run consul members.

Is there an easier way or could you support this in your start script?

check_http closes connection before full response is read

I am not a curl expert by any means, but when testing using the check_http command, I'm seeing a number of exceptions in my spring-boot application related to the health check.

The exceptions are the result of a broken pipe when trying to write the error response when the health check is failing. I can't replicate the issue when testing from my host machine and when running the healthcheck using the Consul container.

I found that if I removed the --fail option from the curl command, the exceptions would go away when the health check ran. Alternatively I can leave the --fail and add a && sleep 1 to the end of the bash command in check-http.

So it appears that with the --fail option, curl is returning before the entire response has been read, which allows the Docker container to shutdown, thus closing the connection, resulting in the broken pipe.

Consul leave cluster = Docker exit 255

I have a cluster of several nodes running Consul. The cluster has the consul "client" agent running on all nodes. These client agents stop when they detect a server agent booting up on the machine.

Right now, I have the "stop" command (in Systemd) to be docker exec consul-client consul leave. This way, the agents leave the cluster gracefully, without showing up as failed.

However, since Consul is PID 1 in the Docker container, this means Docker automatically stops (no problem there), but also returns exit code 255.

I can let Systemd know that this is a valid exit code, but of course it's not always valid, if Consul ever crashes, 255 would be an invalid exit code.

Any thoughts on how to best handle this use-case?

Upgrade to consul 0.3.1

Consul 0.3.1 is now available: http://www.consul.io/downloads.html and includes the following:

FEATURES:

Improved bootstrapping process, thanks to @robxu9

BUG FIXES:

Fixed issue with service re-registration [GH-216]
Fixed handling of -rejoin flag
Restored 0.2 TLS behavior, thanks to @nelhage [GH-233]
Fix the statsite flags, thanks to @nelhage [GH-243]
Fixed filters on criticial / non-passing checks [GH-241]

IMPROVEMENTS:

UI Improvements
Improved handling of Serf snapshot data
Increase reliability of failure detector
More useful logging messages

DNS recursor issue with --net="host"

This is a somewhat convoluted issue, please bear with me.
It is related to these issues:
hashicorp/consul#602
hashicorp/consul#724

When running on an environment such as AWS, you may need (as i do) to be able to resolve both the Amazon servers (for example) and the services that are registered at the Consul cluster.
(from inside other containers)

There is an issue with the Consul Docker image, where you must run the container using --net="host".
otherwise, communication is unstable.

Your local Consul-agent is used as DNS, but recurse to 8.8.8.8 - an issue that was addressed in the link above.
So now you could use -recursor=[internal-network-DNS], and you would be able to resolve both
ec2.internal (for example) and service.consul.
great! right?

but wait! you are using --net="host", so you are getting your container's resolve.conf file from the host!
and in that file, the consul search domain, and the localhost nameserver are not configured. (isolation!)
AND you can NOT use the -dns and -dns-search flags!
(and port 53 in now occupied)

so, for this feature to actually work, you will need to either get the Consul-Docker container to work without using --net="host" OR allow the -dns-search flag so you could modify the /etc/resolve.conf file using the "docker run" command.
(or hack it in the Dockerfile using env vars and bash)

otherwise this will require you to modify the resolve.conf on each and every host you are running the Consul agent on, to something like:
"search ec2.internal service.consul
nameserver 127.0.0.1
nameserver [internal-network-DNS]"

which, of course, goes against the entire Docker concept of containment / isolation / run anywhere etc.

/lib/

When trying to install any package - fail with the error "can't open '/lib/functions.sh'".
Though, I've checked the sources and don't understand how it could disappear.

docker run --rm -it --entrypoint /bin/sh progrium/consul
/ # opkg-install <anything>
...
Configuring terminfo.
//usr/lib/opkg/info/terminfo.postinst: .: line 3: can't open '/lib/functions.sh'

And a lot of other strings with the same error.

Add leave_on_terminate = true to the default consul configuration

When doing "docker stop" on a container, the consul process will receive a SIGTERM. However, the consul agent does not leave the cluster gracefully upon SIGTERM by default. This results in problems when the node is restarted because it is assigned a new IP address. This then triggers this bug: hashicorp/consul#457.

By adding leave_on_terminate = true to the consul configuration in /config/consul.json a docker stop command would be treated by consul as a graceful shutdown of that agent.

SIGINT and SIGTEM are ignored

I noticed that when I run the consul container, it actually ignores all my SIGINT or SIGTERM. For example

docker run --name="foobar" --rm progrium/consul

As you press Ctrl+C, and it doesn't stop. I checked docker source, confirmed that there is signal proxying. You can even try to kill it by

docker kill -S INT foobar

Still, it doesn't stop. The running program is actually /bin/bash, and it has a subprocess consul. I am not pretty sure how bash handles signal, but anyway, it simply ignores interrupts.

health interval ignored

NB: cross posed in the consul google group and progrium/registrator#97, for consistency.

It seems as though any health check I register, be it through the agent (attached to a service) or though the catalog), or even through progrium/registrator leads to a health check interval that is seemingly ignored.

E.g.

curl --include --request PUT --data-binary "{
\"ID\": \"nginx1\",
\"Name\": \"nginx\",
\"Port\": 80,
\"Check\": {
    \"Name\": \"Nginx health check\",
    \"Notes\": \"Script based health check\",
    \"Status\": \"unknown\",
    \"Script\": \"curl $IP/webhook/health\",
    \"Interval\": \"1m\"
}
}" $IP:8500/v1/agent/service/register

This service will get hit on /webhook/health every five minutes, instead of one minute (or whatever value of Interval I happen to set).

Support changing consul domain

Hi,

I want to change the consul domain which isn't possible with this docker image.
I'm happy to contribute the necessary changes, just let me know if you have some ideas how.
I would probably modify the json as part of the run script or something..?

Cmd-line fix for a single instance of the Consul Agent

Hi there and thanks a lot for this.
When I run as documented:

docker run -p 8400:8400 -p 8500:8500 -p 8600:53/udp -h node1 progrium/consul -server -bootstrap

Then consul keeps waiting for other nodes:

curl localhost:8500/v1/catalog/nodes
No cluster leader

This command seems to fix the problem:

docker run -p 8400:8400 -p 8500:8500 -p 8600:53/udp -h node1 progrium/consul -server -bootstrap-expect 1

Is this the right way to start a single instance?

Documentation missing open ports 8301 and 8302 in UDP too for production

Consul uses serf and serfs uses these ports in upd and tcp, in the list of params you have to add again the 8301 and 8302 with /udp if not there will be connectivity issues