We're running AWS EC2 instances within a VPC, with many Docker containers running unde

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

This is usually not an issue when using <a href="https://github.com/progrium/registrat

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

defunct/zombie healthcheck processes,about gliderlabs/docker-consul

Comments (24)

mstoops commented on July 19, 2024

It seems that the above problem happens when the health-check's run-time exceeds the interval; it cannot deal properly with more than one health-check running at the same time, and the older one becomes a zombie.

from docker-consul.

progrium commented on July 19, 2024

Interesting. Might be a Consul issue.

On Wed, Jan 14, 2015 at 10:25 PM, Matthew Stoops [email protected]
wrote:

It seems that the above problem happens when the health-check's run-time
exceeds the interval; it cannot deal properly with more than one
health-check running at the same time, and the older one becomes a zombie.

—
Reply to this email directly or view it on GitHub
#45 (comment)
.

Jeff Lindsay
http://progrium.com

from docker-consul.

skozin commented on July 19, 2024

@progrium, I think that might occur because Docker ignores the --rm flag when docker run command finishes with an error:

$ docker run --rm --name test --net container:nonexistent_or_non_running --entrypoint /bin/bash progrium/consul -c "echo ok"
# FATA[0000] Error response from daemon: Cannot start container xxx: cannot join network of a non running container: yyy

$ docker ps -a
# CONTAINER ID  IMAGE                   COMMAND              CREATED       STATUS  PORTS  NAMES
# 2cab56aabded  progrium/consul:latest  "/bin/bash -c 'echo  1 second ago                 test

Docker authors say that this was done intentionally to simplify debugging process.

So, when target container dies, all checks start producing zombie containers. Those containers are not running, but they hang in the list. I've got this issue and fixed it by manually removing ephemeral containers:

# check-http
# ...

local check_id="chk-$(cat /proc/sys/kernel/random/uuid)"

docker run --name $check_id --net container:$container_id --entrypoint "/bin/bash" $IMAGE -c "$curl_cmd"
local code=$?
docker rm -f $check_id >/dev/null 2>&1 || true

exit $code

(and the same in check-cmd). Should I submit a PR with this fix?

from docker-consul.

skozin commented on July 19, 2024

This is usually not an issue when using Registrator, because it deregisters a container as soon as it dies. But zombie ephemeral containers can still be produced occasionally, when Consul manages to run its check after the target container dies, but before Registrator deregisters it.

from docker-consul.

mstoops commented on July 19, 2024

To be clear, the zombie processes I'm experiencing are happening while the Consul Docker container is running. When I stop the Consul container, all the zombies are killed (if killing zombies were so easy ;)

from docker-consul.

skozin commented on July 19, 2024

Hmm, it seems that I misread you, sorry for that. You are definitely facing a different issue, probably a Consul one. Mine was about zombie containers produced by check-http and check-cmd scripts that come with docker-consul.

from docker-consul.

skozin commented on July 19, 2024

@mstoops, I couldn't reproduce your issue with Consul 0.4.1 and Docker 1.4.1 (though I believe that Docker version shouldn't matter here). I used this test script which returns after 20 seconds:

#!/usr/bin/env bash
sleep 20
exit 2

and this service definition:

{
  "ID": "test-service",
  "Name": "test-service",
  "Tags": [],
  "Port": 12345,
  "Check": {
    "Script": "/opt/consul/test-check",
    "Interval": "5s"
  }
}

Then I watched the output of docker exec -it consul top, and at each moment there were either no check processes in the list, or exactly one check process. So Consul seems to behave correctly, waiting 5 seconds after the last check finished (not started).

from docker-consul.

mstoops commented on July 19, 2024

@skozin That is strange. I currently have 4 AWS Instances, each running about 9 separate docker containers with Consul keeping tabs on the health of the main process running in them; there is a separate Consul Docker container on each AWS Instance. I'm seeing 300-500 zombie processes inside the Consul Docker container when I do a ps aux (used docker exec to get into the container). This is after I set the service definition's interval to 10 seconds, and the curl call used in the heath check script is set to timeout after 5 seconds; the rest of the health check script should return almost immediately. From what you have implied, the interval's timing is based on the end of the script's last run. I'm wondering if the heath checks aren't releasing file descriptors when they finish for some reason.

from docker-consul.

stefanfoulis commented on July 19, 2024

I'm running into (probably) the same problem. I've been using consul with registrator and ambassadord locally. And I was wondering why my docker setup was getting unbearably slow. Turns our I have over 16k stopped docker containers from consul. Just retrieving the list with docker ps -a takes about a minute.

This is the State of one of those containers:

    "State": {
        "Dead": false,
        "Error": "cannot join network of a non running container: 102a3d3b0fbc",
        "ExitCode": 128,
        "FinishedAt": "0001-01-01T00:00:00Z",
        "OOMKilled": false,
        "Paused": false,
        "Pid": 0,
        "Restarting": false,
        "Running": false,
        "StartedAt": "0001-01-01T00:00:00Z"
    },

from docker-consul.

krallin commented on July 19, 2024

@mstoops ,

When you run ps within your consul container, does it report the name of the zombies processes as well (i.e. what executable is stuck)?

Seeing a part of that ps output might make it easier to figure out why zombie processes are created in the first place!

Cheers,

from docker-consul.

vidarh commented on July 19, 2024

I see this consistently with Consul 0.5.0 (using the progrium/consul image - not yet been able to test with the new image) on CoreOS on AWS.

From the consul container:

# consul version
Consul v0.5.0
Consul Protocol: 2 (Understands back to: 1)

# docker info
Containers: 14
Images: 302
Storage Driver: overlay
 Backing Filesystem: extfs
Execution Driver: native-0.2
Kernel Version: 4.0.5 
Operating System: CoreOS 717.3.0
CPUs: 8
Total Memory: 14.69 GiB

# docker version
Client version: 1.5.0
Client API version: 1.17 
Go version (client): go1.4.1
Git commit (client): a8a31ef
OS/Arch (client): linux/amd64
Server version: 1.6.2
Server API version: 1.18
Go version (server): go1.4.2
Git commit (server): 7c8fca2-dirty

Unfortunately the ps output is quite useless. Here's a portion:

11553 root     [docker] 
12936 root     [docker]
16355 root     [docker]

It isn't really a big deal - it takes long enough to build that a Consul restart every few weeks is sufficient currently to avoid it getting to a point where it might be problematic. They're just zombie's after all and resource usage on the machines is fine.

I'm unsure which of our check scripts is causing it - several of them use the docker client. It would be nice if the container had a "proper" pid 1 (or if Consul functioned as one).

from docker-consul.

progrium commented on July 19, 2024

Not sure what to do about this issue. It's either a Docker issue or a Consul issue, I'm not sure there's much I can do here with the Docker image...

from docker-consul.

vidarh commented on July 19, 2024

These are zombie processes. The issue is almost certainly something that manages to effectively double-fork and that Consul isn't reaping those child processes. The most likely cause is that the shell script of the test is killed while the docker client is still running, thus having the effect of orphaning the docker client. In that case, if Consul isn't wait-ing on the orphaned process, we'll have zombie's building.

This basically makes any check scripts that shell out to anything that may potentially overrun it's time slice on occasion unsafe.

When I get a chance I'll try to narrow down a specific test-case, and then try to build a modified version of the docker image that puts in place a minimal init like http://git.suckless.org/sinit - if that fixes it, then the solution is to introduce a minimal init like that in the image (sinit is 79 lines of C) or modify Consul to support reaping the child processes.

from docker-consul.

krallin commented on July 19, 2024

@vidarh Sorry for the shameless plug, but in case you're interested, I actually went down the road of building a minimal init system that reaps zombies (which is what you were suggesting you'd do with sinit) and does proper signal handling.

You can find it here: https://github.com/krallin/tini — it should be drop-in and reap zombies for you (the Jenkins Docker image uses it for that purpose)

Cheers,

from docker-consul.

progrium commented on July 19, 2024

Both of these projects are very cool. I'm wondering if there's anything
close in Go or maybe somebody could port to Go? I'd love to include it in
https://github.com/progrium/entrykit

On Fri, Aug 14, 2015 at 12:34 PM, Thomas Orozco [email protected]
wrote:

@vidarh https://github.com/vidarh Sorry for the shameless plug, but in
case you're interested, I actually went down the road of building a minimal
init system that reaps zombies (which is what you were suggesting you'd do
with sinit) and does proper signal handling.

You can find it here: https://github.com/krallin/tini

Cheers,

—
Reply to this email directly or view it on GitHub
#45 (comment)
.

Jeff Lindsay
http://progrium.com

from docker-consul.

krallin commented on July 19, 2024

@progrium I haven't really considered a Go port of Tini. It's a simple program, but it mainly does low-level things like setting sigmasks, forwarding signals, waiting processes, so C felt like the right language for it.

I'd have to look at whether those things can easily be done in Go! (I originally considered Rust, where that wasn't the case)

from docker-consul.

vidarh commented on July 19, 2024

@krallin very interesting, thanks.

@progrium The minimal requirements for an init are truly minimal, so as @krallin said id boils down to if they can easily be done in Go - they're certainly small enough that if the facilities are there it's not a huge effort.

But sinit ends up at 13512 byes with musl instead of glibc (with glibc it weighs in at 800KB) and no other runtime dependencies (additional build dependencies, of course).

from docker-consul.

vidarh commented on July 19, 2024

@progrium I could resist looking into this, and here's a very, very limited init (even more limited than Suckless Init and certainly much more limited than what @krallin linked to) sufficient to reap the zombies translated to Go. Probably buggy/missing stuff, and I don't know the low level Go stuff well enough to be sure the os/exec stuff ends up being entirely equivalent, but it certainly does reap the child processes:

https://gist.github.com/vidarh/91a110792c86d6c3bb41

Note that as I mention in the readme file I added to the gist, the Go version ends up at 2.9MB vs <14KB for the musl libc statically linked Suckless Init, so I wouldn't recommend actually using this vs. simply dropping in sinit or similar.

However what's more viable is to simply drop this goroutine or a variation into projects to give them the ability to read child processes if/when they run as pid 1:

go func() {
    var wstatus syscall.WaitStatus

    for {
        pid, err := syscall.Wait4(-1, &wstatus, 0, nil)
        if err != nil {
            log.Println(err)
        } else {
            log.Println("Reaping child", pid)
        }
    }
}()

Proposing this for upstream Consul might be the best alternative if they'll take it, as it protects unsuspecting victims building their own containers at a very low complexity cost..

from docker-consul.

progrium commented on July 19, 2024

Yeah, if this is all it is, I might throw it into entrykit, which has some other features that will already use up the size of the runtime. But it'll let me make this container reap processes even if Consul doesn't.

from docker-consul.

vidarh commented on July 19, 2024

@progrium On top of what's in my linked gist you might also want to consider gracefully forwarding SIGTERM and optionally other signals (SIGHUP, SIGUSR1 would both be useful to nicely handle apps that can be made to reload config etc. through a signal) to the spawned process. That adds something like this:

c := make(chan os.Signal, 1)
signal.Notify(c, os.Interrupt)
signal.Notify(c, syscall.SIGTERM)
// Add other signals as needed. e.g. SIGHUP, SIGUSR1
go func(){
    for sig := range c {
       // Signal the child process that was spawned with the same signal
    }
}()

If you're already going to be spawning things under entrykit, then the earlier goroutine will make it better behaved in any case.

from docker-consul.

progrium commented on July 19, 2024

Yep, passing signals already.

On Wed, Aug 19, 2015 at 6:35 AM, Vidar Hokstad [email protected]
wrote:

@progrium https://github.com/progrium On top of what's in my linked
gist you might also want to consider gracefully forwarding SIGTERM and
optionally other signals (SIGHUP, SIGUSR1 would both be useful to nicely
handle apps that can be made to reload config etc. through a signal) to the
spawned process. That adds something like this:

c := make(chan os.Signal, 1)
signal.Notify(c, os.Interrupt, [add other signals as needed])
go func(){
for sig := range c {
// Signal the child process that was spawned with the same signal
}
}()

If you're already going to be spawning things under entrykit, then the
earlier goroutine will make it better behaved in any case.

—
Reply to this email directly or view it on GitHub
#45 (comment)
.

Jeff Lindsay
http://progrium.com

from docker-consul.

cohenaj194 commented on July 19, 2024

Here is how you can absolutely delete all the zombie services: Go into your consul server, find the location of the json files containing the zombies and delete them.

For example I am running consul in a container:

    docker run --restart=unless-stopped -d -h consul0 --name consul0 -v /mnt:/data \
        -p $(hostname -i):8300:8300 \
        -p $(hostname -i):8301:8301 \
        -p $(hostname -i):8301:8301/udp \
        -p $(hostname -i):8302:8302 \
        -p $(hostname -i):8302:8302/udp \
        -p $(hostname -i):8400:8400 \
        -p $(hostname -i):8500:8500 \
        -p $(ifconfig docker0 | awk '/\<inet\>/ { print $2}' | cut -d: -f2):53:53/udp \
        progrium/consul -server -advertise $(hostname -i) -bootstrap-expect 3

Notice the flag -v /mnt:/data this is where all the data consul is storing is located. For me it was located in /mnt. Under this directory you will find several other directories.

config raft serf services tmp

Go into services and you will see the json files containing your services, find any ones that link to a zombie and delete them. Then restart consul. Then repeat for each server in your cluster that has zombies on it.

from docker-consul.

cohenaj194 commented on July 19, 2024

Using the http api for removing services is another much nicer solution. I just figured out how to manually remove services before I figured out how to use the https api.

To delete a service with the http api use the following command:
curl -v -X PUT http://<consul_ip_address>:8500/v1/agent/service/deregister/<ServiceID>

Note that your is a combination of three things: the IP address of host machine the container is running on, the name of the container, and the inner port of the container (i.e. 80 for apache, 3000 for node js, 8000 for django, ect) all separated by colins :

Heres an example of what that would actually look like:
curl -v -X PUT http://1.2.3.4:8500/v1/agent/service/deregister/192.168.1.1:sharp_apple:80

If you want an easy way to get the ServiceID then just curl the service that contains a zombie:
curl -s http://<consul_ip_address>:8500/v1/catalog/service/<your_services_name>

Heres a real example for a service called someapp that will return all the services under it:
curl -s http://1.2.3.4:8500/v1/catalog/service/someapp

from docker-consul.

vidarh commented on July 19, 2024

@cohenaj194 those are useful tips, but this is not the kind of zombies this issue is/was about. This issue was about a problem where if health checks intentionally or accidentally ended up "double-forking", you'd build up zombie processes for those health checks if running consul in a container without a zombie-reaping init. It was an entirely separate problem from cleaning out services.

from docker-consul.

defunct/zombie healthcheck processes about docker-consul HOT 24 OPEN

Comments (24)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent