Comments (24)
It seems that the above problem happens when the health-check's run-time exceeds the interval; it cannot deal properly with more than one health-check running at the same time, and the older one becomes a zombie.
from docker-consul.
Interesting. Might be a Consul issue.
On Wed, Jan 14, 2015 at 10:25 PM, Matthew Stoops [email protected]
wrote:
It seems that the above problem happens when the health-check's run-time
exceeds the interval; it cannot deal properly with more than one
health-check running at the same time, and the older one becomes a zombie.—
Reply to this email directly or view it on GitHub
#45 (comment)
.
Jeff Lindsay
http://progrium.com
from docker-consul.
@progrium, I think that might occur because Docker ignores the --rm
flag when docker run
command finishes with an error:
$ docker run --rm --name test --net container:nonexistent_or_non_running --entrypoint /bin/bash progrium/consul -c "echo ok"
# FATA[0000] Error response from daemon: Cannot start container xxx: cannot join network of a non running container: yyy
$ docker ps -a
# CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
# 2cab56aabded progrium/consul:latest "/bin/bash -c 'echo 1 second ago test
Docker authors say that this was done intentionally to simplify debugging process.
So, when target container dies, all checks start producing zombie containers. Those containers are not running, but they hang in the list. I've got this issue and fixed it by manually removing ephemeral containers:
# check-http
# ...
local check_id="chk-$(cat /proc/sys/kernel/random/uuid)"
docker run --name $check_id --net container:$container_id --entrypoint "/bin/bash" $IMAGE -c "$curl_cmd"
local code=$?
docker rm -f $check_id >/dev/null 2>&1 || true
exit $code
(and the same in check-cmd
). Should I submit a PR with this fix?
from docker-consul.
This is usually not an issue when using Registrator, because it deregisters a container as soon as it dies. But zombie ephemeral containers can still be produced occasionally, when Consul manages to run its check after the target container dies, but before Registrator deregisters it.
from docker-consul.
To be clear, the zombie processes I'm experiencing are happening while the Consul Docker container is running. When I stop the Consul container, all the zombies are killed (if killing zombies were so easy ;)
from docker-consul.
Hmm, it seems that I misread you, sorry for that. You are definitely facing a different issue, probably a Consul one. Mine was about zombie containers produced by check-http
and check-cmd
scripts that come with docker-consul
.
from docker-consul.
@mstoops, I couldn't reproduce your issue with Consul 0.4.1 and Docker 1.4.1 (though I believe that Docker version shouldn't matter here). I used this test script which returns after 20 seconds:
#!/usr/bin/env bash
sleep 20
exit 2
and this service definition:
{
"ID": "test-service",
"Name": "test-service",
"Tags": [],
"Port": 12345,
"Check": {
"Script": "/opt/consul/test-check",
"Interval": "5s"
}
}
Then I watched the output of docker exec -it consul top
, and at each moment there were either no check processes in the list, or exactly one check process. So Consul seems to behave correctly, waiting 5 seconds after the last check finished (not started).
from docker-consul.
@skozin That is strange. I currently have 4 AWS Instances, each running about 9 separate docker containers with Consul keeping tabs on the health of the main process running in them; there is a separate Consul Docker container on each AWS Instance. I'm seeing 300-500 zombie processes inside the Consul Docker container when I do a ps aux (used docker exec to get into the container). This is after I set the service definition's interval to 10 seconds, and the curl call used in the heath check script is set to timeout after 5 seconds; the rest of the health check script should return almost immediately. From what you have implied, the interval's timing is based on the end of the script's last run. I'm wondering if the heath checks aren't releasing file descriptors when they finish for some reason.
from docker-consul.
I'm running into (probably) the same problem. I've been using consul with registrator and ambassadord locally. And I was wondering why my docker setup was getting unbearably slow. Turns our I have over 16k stopped docker containers from consul. Just retrieving the list with docker ps -a
takes about a minute.
This is the State
of one of those containers:
"State": {
"Dead": false,
"Error": "cannot join network of a non running container: 102a3d3b0fbc",
"ExitCode": 128,
"FinishedAt": "0001-01-01T00:00:00Z",
"OOMKilled": false,
"Paused": false,
"Pid": 0,
"Restarting": false,
"Running": false,
"StartedAt": "0001-01-01T00:00:00Z"
},
from docker-consul.
@mstoops ,
When you run ps
within your consul container, does it report the name of the zombies processes as well (i.e. what executable is stuck)?
Seeing a part of that ps
output might make it easier to figure out why zombie processes are created in the first place!
Cheers,
from docker-consul.
I see this consistently with Consul 0.5.0 (using the progrium/consul image - not yet been able to test with the new image) on CoreOS on AWS.
From the consul container:
# consul version
Consul v0.5.0
Consul Protocol: 2 (Understands back to: 1)
# docker info
Containers: 14
Images: 302
Storage Driver: overlay
Backing Filesystem: extfs
Execution Driver: native-0.2
Kernel Version: 4.0.5
Operating System: CoreOS 717.3.0
CPUs: 8
Total Memory: 14.69 GiB
# docker version
Client version: 1.5.0
Client API version: 1.17
Go version (client): go1.4.1
Git commit (client): a8a31ef
OS/Arch (client): linux/amd64
Server version: 1.6.2
Server API version: 1.18
Go version (server): go1.4.2
Git commit (server): 7c8fca2-dirty
Unfortunately the ps output is quite useless. Here's a portion:
11553 root [docker]
12936 root [docker]
16355 root [docker]
It isn't really a big deal - it takes long enough to build that a Consul restart every few weeks is sufficient currently to avoid it getting to a point where it might be problematic. They're just zombie's after all and resource usage on the machines is fine.
I'm unsure which of our check scripts is causing it - several of them use the docker client. It would be nice if the container had a "proper" pid 1 (or if Consul functioned as one).
from docker-consul.
Not sure what to do about this issue. It's either a Docker issue or a Consul issue, I'm not sure there's much I can do here with the Docker image...
from docker-consul.
These are zombie processes. The issue is almost certainly something that manages to effectively double-fork and that Consul isn't reaping those child processes. The most likely cause is that the shell script of the test is killed while the docker client is still running, thus having the effect of orphaning the docker client. In that case, if Consul isn't wait-ing on the orphaned process, we'll have zombie's building.
This basically makes any check scripts that shell out to anything that may potentially overrun it's time slice on occasion unsafe.
When I get a chance I'll try to narrow down a specific test-case, and then try to build a modified version of the docker image that puts in place a minimal init like http://git.suckless.org/sinit - if that fixes it, then the solution is to introduce a minimal init like that in the image (sinit is 79 lines of C) or modify Consul to support reaping the child processes.
from docker-consul.
@vidarh Sorry for the shameless plug, but in case you're interested, I actually went down the road of building a minimal init system that reaps zombies (which is what you were suggesting you'd do with sinit
) and does proper signal handling.
You can find it here: https://github.com/krallin/tini — it should be drop-in and reap zombies for you (the Jenkins Docker image uses it for that purpose)
Cheers,
from docker-consul.
Both of these projects are very cool. I'm wondering if there's anything
close in Go or maybe somebody could port to Go? I'd love to include it in
https://github.com/progrium/entrykit
On Fri, Aug 14, 2015 at 12:34 PM, Thomas Orozco [email protected]
wrote:
@vidarh https://github.com/vidarh Sorry for the shameless plug, but in
case you're interested, I actually went down the road of building a minimal
init system that reaps zombies (which is what you were suggesting you'd do
with sinit) and does proper signal handling.You can find it here: https://github.com/krallin/tini
Cheers,
—
Reply to this email directly or view it on GitHub
#45 (comment)
.
Jeff Lindsay
http://progrium.com
from docker-consul.
@progrium I haven't really considered a Go port of Tini. It's a simple program, but it mainly does low-level things like setting sigmasks, forwarding signals, waiting processes, so C felt like the right language for it.
I'd have to look at whether those things can easily be done in Go! (I originally considered Rust, where that wasn't the case)
from docker-consul.
@krallin very interesting, thanks.
@progrium The minimal requirements for an init are truly minimal, so as @krallin said id boils down to if they can easily be done in Go - they're certainly small enough that if the facilities are there it's not a huge effort.
But sinit ends up at 13512 byes with musl instead of glibc (with glibc it weighs in at 800KB) and no other runtime dependencies (additional build dependencies, of course).
from docker-consul.
@progrium I could resist looking into this, and here's a very, very limited init (even more limited than Suckless Init and certainly much more limited than what @krallin linked to) sufficient to reap the zombies translated to Go. Probably buggy/missing stuff, and I don't know the low level Go stuff well enough to be sure the os/exec stuff ends up being entirely equivalent, but it certainly does reap the child processes:
https://gist.github.com/vidarh/91a110792c86d6c3bb41
Note that as I mention in the readme file I added to the gist, the Go version ends up at 2.9MB vs <14KB for the musl libc statically linked Suckless Init, so I wouldn't recommend actually using this vs. simply dropping in sinit or similar.
However what's more viable is to simply drop this goroutine or a variation into projects to give them the ability to read child processes if/when they run as pid 1:
go func() {
var wstatus syscall.WaitStatus
for {
pid, err := syscall.Wait4(-1, &wstatus, 0, nil)
if err != nil {
log.Println(err)
} else {
log.Println("Reaping child", pid)
}
}
}()
Proposing this for upstream Consul might be the best alternative if they'll take it, as it protects unsuspecting victims building their own containers at a very low complexity cost..
from docker-consul.
Yeah, if this is all it is, I might throw it into entrykit, which has some other features that will already use up the size of the runtime. But it'll let me make this container reap processes even if Consul doesn't.
from docker-consul.
@progrium On top of what's in my linked gist you might also want to consider gracefully forwarding SIGTERM and optionally other signals (SIGHUP, SIGUSR1 would both be useful to nicely handle apps that can be made to reload config etc. through a signal) to the spawned process. That adds something like this:
c := make(chan os.Signal, 1)
signal.Notify(c, os.Interrupt)
signal.Notify(c, syscall.SIGTERM)
// Add other signals as needed. e.g. SIGHUP, SIGUSR1
go func(){
for sig := range c {
// Signal the child process that was spawned with the same signal
}
}()
If you're already going to be spawning things under entrykit, then the earlier goroutine will make it better behaved in any case.
from docker-consul.
Yep, passing signals already.
On Wed, Aug 19, 2015 at 6:35 AM, Vidar Hokstad [email protected]
wrote:
@progrium https://github.com/progrium On top of what's in my linked
gist you might also want to consider gracefully forwarding SIGTERM and
optionally other signals (SIGHUP, SIGUSR1 would both be useful to nicely
handle apps that can be made to reload config etc. through a signal) to the
spawned process. That adds something like this:c := make(chan os.Signal, 1)
signal.Notify(c, os.Interrupt, [add other signals as needed])
go func(){
for sig := range c {
// Signal the child process that was spawned with the same signal
}
}()If you're already going to be spawning things under entrykit, then the
earlier goroutine will make it better behaved in any case.—
Reply to this email directly or view it on GitHub
#45 (comment)
.
Jeff Lindsay
http://progrium.com
from docker-consul.
Here is how you can absolutely delete all the zombie services: Go into your consul server, find the location of the json files containing the zombies and delete them.
For example I am running consul in a container:
docker run --restart=unless-stopped -d -h consul0 --name consul0 -v /mnt:/data \
-p $(hostname -i):8300:8300 \
-p $(hostname -i):8301:8301 \
-p $(hostname -i):8301:8301/udp \
-p $(hostname -i):8302:8302 \
-p $(hostname -i):8302:8302/udp \
-p $(hostname -i):8400:8400 \
-p $(hostname -i):8500:8500 \
-p $(ifconfig docker0 | awk '/\<inet\>/ { print $2}' | cut -d: -f2):53:53/udp \
progrium/consul -server -advertise $(hostname -i) -bootstrap-expect 3
Notice the flag -v /mnt:/data
this is where all the data consul is storing is located. For me it was located in /mnt
. Under this directory you will find several other directories.
config raft serf services tmp
Go into services
and you will see the json files containing your services, find any ones that link to a zombie and delete them. Then restart consul. Then repeat for each server in your cluster that has zombies on it.
from docker-consul.
Using the http api for removing services is another much nicer solution. I just figured out how to manually remove services before I figured out how to use the https api.
To delete a service with the http api use the following command:
curl -v -X PUT http://<consul_ip_address>:8500/v1/agent/service/deregister/<ServiceID>
Note that your is a combination of three things: the IP address of host machine the container is running on, the name of the container, and the inner port of the container (i.e. 80 for apache, 3000 for node js, 8000 for django, ect) all separated by colins :
Heres an example of what that would actually look like:
curl -v -X PUT http://1.2.3.4:8500/v1/agent/service/deregister/192.168.1.1:sharp_apple:80
If you want an easy way to get the ServiceID then just curl the service that contains a zombie:
curl -s http://<consul_ip_address>:8500/v1/catalog/service/<your_services_name>
Heres a real example for a service called someapp that will return all the services under it:
curl -s http://1.2.3.4:8500/v1/catalog/service/someapp
from docker-consul.
@cohenaj194 those are useful tips, but this is not the kind of zombies this issue is/was about. This issue was about a problem where if health checks intentionally or accidentally ended up "double-forking", you'd build up zombie processes for those health checks if running consul in a container without a zombie-reaping init. It was an entirely separate problem from cleaning out services.
from docker-consul.
Related Issues (20)
- error: No cluster leader when reboot HOT 4
- Build 0.6.3 - server HOT 4
- Consul container immediately shuts down HOT 5
- Failed to heartbeat after deliberately killing cluster leader
- Docker Hub failed to build consul-server and consul-agent against 0.6.4 HOT 4
- consul on swarm multi-node
- Tag for 0.6.4
- docker consul-server not exposing port HOT 3
- Docker deployment of services, can not modify the registered address of docker-consul
- Docker deployment of services, can not modify the registered address of docker-consul
- Add a DEPRECATION notice HOT 2
- Add nc to image?
- Error 500 when accesing web UI
- [ERR] agent: failed to sync remote state: No cluster leader
- I wonder if any notes on how to consul supports scalability HOT 1
- Dockerfiles not exposing ports
- How to configure the docker-consul for security ?
- CONSUL_BIND_INTERFACE not working?
- advertise flag ignored HOT 1
- Need help with "docker swarm init" and Consul HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from docker-consul.