Describe the bug We have a separate fleet of Puma w

This bug was actually reported in <a href="https://bugs.ruby-lang.org/issues/19837" re

I should note that on Linux Docker, PID 1 seems to work fine: <div class="highligh

Interesting, that was removed in 3.3: <a class="issue-link js-issue-link" data-error-t

Puma cluster not reaping child processes when PID is 1 with Puma 6.4.1,about puma/puma

Comments (18)

stanhu commented on May 18, 2024 2

Ok, our application has its own process supervisor that spawns a Prometheus metrics Web server. If I disable that, for some reason Process.wait2(-1, Process::WNOHANG) works and reaps the processes properly.

Most likely we're trapping SIGCHLD and interfering with the wait.

from puma.

stanhu commented on May 18, 2024 1

This bug was actually reported in https://bugs.ruby-lang.org/issues/19837 and fixed in the ruby_3_2 and ruby_3_1 stable branches, but there has yet to be a release with the fix.

from puma.

stanhu commented on May 18, 2024

On my test instance with Puma 6.4.1, I ran kill -9 44, and puma: cluster worker 0 did not come back:

git@gitlab-webservice-default-78664bb757-2nxvh:/var/log/gitlab$ ps -ef
UID          PID    PPID  C STIME TTY          TIME CMD
git            1       0  0 Jan09 ?        00:01:39 puma 6.4.1 (tcp://0.0.0.0:8080) [gitlab-puma-worker]
git           23       1  0 Jan09 ?        00:05:46 /usr/local/bin/gitlab-logger /var/log/gitlab
git           41       1  0 Jan09 ?        00:01:55 ruby /srv/gitlab/bin/metrics-server
git           44       1  0 Jan09 ?        00:02:41 [ruby] <defunct>
git           46       1  0 Jan09 ?        00:02:38 puma: cluster worker 1: 1 [gitlab-puma-worker]
git           48       1  0 Jan09 ?        00:02:42 puma: cluster worker 2: 1 [gitlab-puma-worker]
git           49       1  0 Jan09 ?        00:02:41 puma: cluster worker 3: 1 [gitlab-puma-worker]
git         5205       0  0 21:57 pts/0    00:00:00 bash
git         5331    5205  0 22:00 pts/0    00:00:00 ps -ef

With Puma 6.4.0, that worked fine:

git@gitlab-webservice-default-78664bb757-97skg:/$ ps -ef
UID          PID    PPID  C STIME TTY          TIME CMD
git            1       0 71 22:06 ?        00:00:36 puma 6.4.0 (tcp://0.0.0.0:8080) [gitlab-puma-worker]
git           22       1  0 22:06 ?        00:00:00 /usr/local/bin/gitlab-logger /var/log/gitlab
git           36       0  0 22:07 pts/0    00:00:00 bash
git           65       1 22 22:07 ?        00:00:02 ruby /srv/gitlab/bin/metrics-server
git           68       1 22 22:07 ?        00:00:02 puma: cluster worker 0: 1 [gitlab-puma-worker]
git           70       1 22 22:07 ?        00:00:02 puma: cluster worker 1: 1 [gitlab-puma-worker]
git           72       1 22 22:07 ?        00:00:02 puma: cluster worker 2: 1 [gitlab-puma-worker]
git           74       1 22 22:07 ?        00:00:02 puma: cluster worker 3: 1 [gitlab-puma-worker]
git          148      36  0 22:07 pts/0    00:00:00 ps -ef
git@gitlab-webservice-default-78664bb757-97skg:/$ kill -9 68
git@gitlab-webservice-default-78664bb757-97skg:/$ ps -ef
UID          PID    PPID  C STIME TTY          TIME CMD
git            1       0 66 22:06 ?        00:00:36 puma 6.4.0 (tcp://0.0.0.0:8080) [gitlab-puma-worker]
git           22       1  0 22:06 ?        00:00:00 /usr/local/bin/gitlab-logger /var/log/gitlab
git           36       0  0 22:07 pts/0    00:00:00 bash
git           65       1 16 22:07 ?        00:00:02 ruby /srv/gitlab/bin/metrics-server
git           70       1 16 22:07 ?        00:00:02 puma: cluster worker 1: 1 [gitlab-puma-worker]
git           72       1 17 22:07 ?        00:00:02 puma: cluster worker 2: 1 [gitlab-puma-worker]
git           74       1 16 22:07 ?        00:00:02 puma: cluster worker 3: 1 [gitlab-puma-worker]
git          149       1 19 22:07 ?        00:00:00 puma: cluster worker 0: 1 [gitlab-puma-worker]
git          165      36  0 22:07 pts/0    00:00:00 ps -ef

from puma.

stanhu commented on May 18, 2024

I added debugging messages, and it seems that Process.wait2(-1, Process::WNOHANG) doesn't return anything when I ran kill <PID of worker>. The process is in the defunct state, so I'm a bit surprised that didn't work.

I applied this patch to get things working again:

diff --git a/lib/puma/cluster.rb b/lib/puma/cluster.rb
index 0d7c12bd..05d58445 100644
--- a/lib/puma/cluster.rb
+++ b/lib/puma/cluster.rb
@@ -562,7 +562,7 @@ module Puma
         begin
           # When `fork_worker` is enabled, some worker may not be direct children, but grand children.
           # Because of this they won't be reaped by `Process.wait2(-1)`, so we need to check them individually)
-          if reaped_children.delete(w.pid) || (@options[:fork_worker] && Process.wait(w.pid, Process::WNOHANG))
+          if reaped_children.delete(w.pid) || Process.wait(w.pid, Process::WNOHANG)
             true
           else
             w.term if w.term?

from puma.

stanhu commented on May 18, 2024

I should note that on Linux Docker, PID 1 seems to work fine:

#!/bin/env ruby

fork do
  loop { sleep 1 }
end

loop do
  puts Process.wait2(-1, Process::WNOHANG)
  sleep 1
end

My Dockerfile:

FROM ruby:3.1
COPY listen.rb .
RUN chmod +x listen.rb
ENTRYPOINT ["/listen.rb"]

If I run this container and forcibly kill the child:

% docker exec -it 3b53dc0dcbd5 bash
root@3b53dc0dcbd5:/# ps -ef
UID        PID  PPID  C STIME TTY          TIME CMD
root         1     0  0 23:36 pts/0    00:00:00 ruby /listen.rb
root         7     1  0 23:36 pts/0    00:00:00 ruby /listen.rb
root         8     0  0 23:36 pts/1    00:00:00 bash
root        14     8  0 23:36 pts/1    00:00:00 ps -ef
root@3b53dc0dcbd5:/# kill 7
root@3b53dc0dcbd5:/# %

I see:

7
pid 7 SIGTERM (signal 15)
/listen.rb:8:in `wait2': No child processes (Errno::ECHILD)
	from /listen.rb:8:in `block in <main>'
	from /listen.rb:7:in `loop'
	from /listen.rb:7:in `<main>'

from puma.

stanhu commented on May 18, 2024

Strange, this worked fine with Kubernetes. I repeated the test above with a Google Kubernetes Engine pod:

% kubectl run listen-test --image=registry.gitlab.com/stanhu/lfs-test/listen-test:latest
pod/listen-test created
% kubectl exec -it listen-test -- bash
root@listen-test:/# ps -ef
UID          PID    PPID  C STIME TTY          TIME CMD
root           1       0  0 05:44 ?        00:00:00 ruby /listen.rb
root           7       1  0 05:44 ?        00:00:00 ruby /listen.rb
root          21       0  0 05:46 pts/0    00:00:00 bash
root          27      21  0 05:46 pts/0    00:00:00 ps -ef
root@listen-test:/# kill 7
bash: kill: (7) - No such process
root@listen-test:/# command terminated with exit code 137

With kubectl logs -f listen-test running, I see:

7
pid 7 SIGTERM (signal 15)
/listen.rb:8:in `wait2': No child processes (Errno::ECHILD)
	from /listen.rb:8:in `block in <main>'
	from /listen.rb:7:in `loop'
	from /listen.rb:7:in `<main>'

I wonder what's different about Puma.

from puma.

stanhu commented on May 18, 2024

Still can't replicate this issue with a simple pod running Puma:

`Dockerfile.puma`

FROM ruby:3.1

RUN gem install puma:6.4.1
COPY hello.ru .

ENTRYPOINT ["puma", "hello.ru", "-w", "2"]

`hello.ru`

hdrs = {'Content-Type'.freeze => 'text/plain'.freeze}.freeze
body = ['Hello World'.freeze].freeze
run lambda { |env| [200, hdrs, body] }

Process.wait2(-1, Process::WNOHANG) is working fine in a pod as PID 1. I tried with a non-root user as well.

from puma.

casperisfine commented on May 18, 2024

Most likely we're trapping SIGCHLD and interfering with the wait.

I only skimmed the code you linked quickly (got hundreds of mail to catch up to today), but this sound weird.

I'm not sure why trapping SIGCHLD would make the wait fail. But I suppose at this stage it's best to try to come up with a smaller repro so we can better understand what's going on, and see what we could do to make this more resilient.

from puma.

stanhu commented on May 18, 2024

Yeah, I don't see any evidence we're trapping SIGCHLD, and I've tried to add signal handlers to see if that changes anything. I can't reproduce the problem yet.

I did notice that the Ruby implementation for Process.wait2 seems to mention SIGCHLD for some reason: https://github.com/ruby/ruby/blob/50c6cabadca44b7b034eae5dcc8017154a2858bd/process.c#L1343-L1348

from puma.

casperisfine commented on May 18, 2024

Interesting, that was removed in 3.3: ruby/ruby#7527

Seems like it was converting a blocking waitpid into a non blocking one by waiting for SIGCHLD, sounds quite brittle to me, but I don't have the full context.

from puma.

stanhu commented on May 18, 2024

Ok, it looks like in Ruby 3.1 and 3.2 Process.detach(<some PID != 1>) appears to prevent Process.wait2(-1, Process::WNOHANG) from finding child processes when the parent PID is 1.

The problem doesn't happen in Ruby 3.3. I wonder if ruby/ruby#7476 or ruby/ruby#7527 fixed this.

I'll update the comments in #3314 in light of this, so I think that pull request is still a good idea.

Here's a sample reproduction:

`Dockerfile`

FROM ruby:3.2

COPY listen.rb .
COPY test.sh .

RUN chmod +x listen.rb
RUN chmod +x test.sh

ENTRYPOINT ["/listen.rb"]

`listen.rb`

#!/bin/env ruby

fork do
  loop { sleep 1 }
end

Process.spawn({}, "./test.sh", err: $stderr, out: $stdout, pgroup: true).tap do |pid|
  STDERR.puts "detaching PID #{pid}"
  Process.detach(pid)
end

loop do
  STDERR.puts Process.wait2(-1, Process::WNOHANG)
  sleep 1
end

`test.sh`

#!/bin/sh
sleep 600

from puma.

stanhu commented on May 18, 2024

It appears that Process.detach simply spawns a separate thread that does a blocking waitpid. I think this uses the SIGCHLD implementation introduced in ruby/ruby@054a412. This comment in the commit message is telling:

We also work to suppress false-positives from Process.wait(-1, Process::WNOHANG) to quiets warnings from spec/ruby/core/process/wait2_spec.rb with MJIT enabled.

This makes me think that this code is suppressing child PIDs when the parent is PID 1.

from puma.

stanhu commented on May 18, 2024

I've confirmed that this Ruby SIGCHLD business is responsible. I disabled WAITPID_USE_SIGCHLD in my Ruby 3.1.4 interpreter, and Process.wait(-1, Process::WNOHANG) started working again:

diff --git a/vm_core.h b/vm_core.h
index 1cc0659700..0e7d1643fe 100644
--- a/vm_core.h
+++ b/vm_core.h
@@ -126,7 +126,7 @@
 #endif
 
 /* define to 0 to test old code path */
-#define WAITPID_USE_SIGCHLD (RUBY_SIGCHLD || SIGCHLD_LOSSY)
+#define WAITPID_USE_SIGCHLD 0
 
 #if defined(SIGSEGV) && defined(HAVE_SIGALTSTACK) && defined(SA_SIGINFO) && !defined(__NetBSD__)
 #  define USE_SIGALTSTACK

from puma.

stanhu commented on May 18, 2024

It appears that only Process.detach is only needed; PID 1 is not relevant. This Ruby script will get stuck in Ruby 3.1.4 and 3.2.2, but exits immediately in Ruby 3.3.0:

#!/bin/env ruby

forked_pid = fork do
  loop { sleep 1 }
end

Process.spawn({}, "sh -c 'sleep 60'", err: $stderr, out: $stdout).tap do |pid|
  puts "detaching PID #{pid}"
  Process.detach(pid)
end

child_waiter = Thread.new do
  puts "Waiting for child process to die..."

  # This works 
  # puts Process.wait2(forked_pid)

  # This fails in Ruby 3.1 and 3.2
  puts Process.wait2(-1)
end

process_killer = Thread.new do
  puts "Killing #{forked_pid}"
  system("kill #{forked_pid}")
end

child_waiter.join
process_killer.join

from puma.

dentarg commented on May 18, 2024

@stanhu I just realised I might be experiencing deja-vu with this thing. I have a bunch of notes/links at https://github.com/dentarg/gists/tree/master/gists/ruby-bug-15499#ruby--puma-bug about "the ruby 2.6.0 wait bug", some comments:

Looks like #1741 implemented a workaround. Has something changed in Ruby yet again?

from puma.

stanhu commented on May 18, 2024

@dentarg Interesting! Given that ruby/ruby@054a412 was introduced in Ruby 2.6, I wonder if this broke Process.waitpid in a number of situations. With Ruby 3.3, that SIGCHLD implementation is gone, so I wonder if all these wait-related issues can be fixed without workarounds.

I see https://github.com/puma/puma/pull/1741/files#r266122715 mentions Process.waitpid(-1, Process::WNOHANG) was not working, and #3255 introduced this in Puma v6.4.0. This seems to work okay until you use Process.detach, but maybe there are more situations where it doesn't work.

from puma.

stanhu commented on May 18, 2024

I created https://bugs.ruby-lang.org/issues/20181, but I noticed https://bugs.ruby-lang.org/issues/19322 mentions this summary:

Programs doing waitpid -1 are bad and wrong, nobody should ever do that, if any code in your program does this anywhere, then Ruby should no longer make any guarantees about subprocess management working correctly in the entire process.

Programs doing waitpid -1 are widely deployed, it would be good if, when writing gems, there were APIs we could use which offer better isolation and composibility than the classic unix APIs, so that our gems work no matter what their containing processes are doing.

Gems should never be spawning child processes anyway.

from puma.

dentarg commented on May 18, 2024

@dentarg Interesting! Given that ruby/ruby@054a412 was introduced in Ruby 2.6, I wonder if this broke Process.waitpid in a number of situations. With Ruby 3.3, that SIGCHLD implementation is gone, so I wonder if all these wait-related issues can be fixed without workarounds.

Found another thing that probably relates to this? Just wanted to connect the dots.

https://github.com/puma/puma/pull/2767/files#r872970473

Well, I'm stuck. First of all, I don't like this dirty hack. Besides that, it goes into an infinity loop on the SIGCHLD, I don't know exactly why.
https://github.com/puma/puma/pull/2767/files#r873130086

I've just tested this hack independently of puma and it works fine on ruby 2.5 and fails on all next versions

from puma.

Puma cluster not reaping child processes when PID is 1 with Puma 6.4.1 about puma HOT 18 CLOSED

Comments (18)

`Dockerfile.puma`

`hello.ru`

`Dockerfile`

`listen.rb`

`test.sh`

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent