Comments (18)
Ok, our application has its own process supervisor that spawns a Prometheus metrics Web server. If I disable that, for some reason Process.wait2(-1, Process::WNOHANG)
works and reaps the processes properly.
- https://gitlab.com/gitlab-org/gitlab/-/blob/f9fbbadece7f62f10aac9d5fcc1401464bff1d7c/lib/gitlab/process_supervisor.rb
- https://gitlab.com/gitlab-org/gitlab/-/blob/f9fbbadece7f62f10aac9d5fcc1401464bff1d7c/lib/gitlab/process_management.rb
- https://gitlab.com/gitlab-org/gitlab/-/blob/f9fbbadece7f62f10aac9d5fcc1401464bff1d7c/metrics_server/metrics_server.rb#L46-48
Most likely we're trapping SIGCHLD
and interfering with the wait
.
from puma.
This bug was actually reported in https://bugs.ruby-lang.org/issues/19837 and fixed in the ruby_3_2
and ruby_3_1
stable branches, but there has yet to be a release with the fix.
from puma.
On my test instance with Puma 6.4.1, I ran kill -9 44
, and puma: cluster worker 0
did not come back:
git@gitlab-webservice-default-78664bb757-2nxvh:/var/log/gitlab$ ps -ef
UID PID PPID C STIME TTY TIME CMD
git 1 0 0 Jan09 ? 00:01:39 puma 6.4.1 (tcp://0.0.0.0:8080) [gitlab-puma-worker]
git 23 1 0 Jan09 ? 00:05:46 /usr/local/bin/gitlab-logger /var/log/gitlab
git 41 1 0 Jan09 ? 00:01:55 ruby /srv/gitlab/bin/metrics-server
git 44 1 0 Jan09 ? 00:02:41 [ruby] <defunct>
git 46 1 0 Jan09 ? 00:02:38 puma: cluster worker 1: 1 [gitlab-puma-worker]
git 48 1 0 Jan09 ? 00:02:42 puma: cluster worker 2: 1 [gitlab-puma-worker]
git 49 1 0 Jan09 ? 00:02:41 puma: cluster worker 3: 1 [gitlab-puma-worker]
git 5205 0 0 21:57 pts/0 00:00:00 bash
git 5331 5205 0 22:00 pts/0 00:00:00 ps -ef
With Puma 6.4.0, that worked fine:
git@gitlab-webservice-default-78664bb757-97skg:/$ ps -ef
UID PID PPID C STIME TTY TIME CMD
git 1 0 71 22:06 ? 00:00:36 puma 6.4.0 (tcp://0.0.0.0:8080) [gitlab-puma-worker]
git 22 1 0 22:06 ? 00:00:00 /usr/local/bin/gitlab-logger /var/log/gitlab
git 36 0 0 22:07 pts/0 00:00:00 bash
git 65 1 22 22:07 ? 00:00:02 ruby /srv/gitlab/bin/metrics-server
git 68 1 22 22:07 ? 00:00:02 puma: cluster worker 0: 1 [gitlab-puma-worker]
git 70 1 22 22:07 ? 00:00:02 puma: cluster worker 1: 1 [gitlab-puma-worker]
git 72 1 22 22:07 ? 00:00:02 puma: cluster worker 2: 1 [gitlab-puma-worker]
git 74 1 22 22:07 ? 00:00:02 puma: cluster worker 3: 1 [gitlab-puma-worker]
git 148 36 0 22:07 pts/0 00:00:00 ps -ef
git@gitlab-webservice-default-78664bb757-97skg:/$ kill -9 68
git@gitlab-webservice-default-78664bb757-97skg:/$ ps -ef
UID PID PPID C STIME TTY TIME CMD
git 1 0 66 22:06 ? 00:00:36 puma 6.4.0 (tcp://0.0.0.0:8080) [gitlab-puma-worker]
git 22 1 0 22:06 ? 00:00:00 /usr/local/bin/gitlab-logger /var/log/gitlab
git 36 0 0 22:07 pts/0 00:00:00 bash
git 65 1 16 22:07 ? 00:00:02 ruby /srv/gitlab/bin/metrics-server
git 70 1 16 22:07 ? 00:00:02 puma: cluster worker 1: 1 [gitlab-puma-worker]
git 72 1 17 22:07 ? 00:00:02 puma: cluster worker 2: 1 [gitlab-puma-worker]
git 74 1 16 22:07 ? 00:00:02 puma: cluster worker 3: 1 [gitlab-puma-worker]
git 149 1 19 22:07 ? 00:00:00 puma: cluster worker 0: 1 [gitlab-puma-worker]
git 165 36 0 22:07 pts/0 00:00:00 ps -ef
from puma.
I added debugging messages, and it seems that Process.wait2(-1, Process::WNOHANG)
doesn't return anything when I ran kill <PID of worker>
. The process is in the defunct
state, so I'm a bit surprised that didn't work.
I applied this patch to get things working again:
diff --git a/lib/puma/cluster.rb b/lib/puma/cluster.rb
index 0d7c12bd..05d58445 100644
--- a/lib/puma/cluster.rb
+++ b/lib/puma/cluster.rb
@@ -562,7 +562,7 @@ module Puma
begin
# When `fork_worker` is enabled, some worker may not be direct children, but grand children.
# Because of this they won't be reaped by `Process.wait2(-1)`, so we need to check them individually)
- if reaped_children.delete(w.pid) || (@options[:fork_worker] && Process.wait(w.pid, Process::WNOHANG))
+ if reaped_children.delete(w.pid) || Process.wait(w.pid, Process::WNOHANG)
true
else
w.term if w.term?
from puma.
I should note that on Linux Docker, PID 1 seems to work fine:
#!/bin/env ruby
fork do
loop { sleep 1 }
end
loop do
puts Process.wait2(-1, Process::WNOHANG)
sleep 1
end
My Dockerfile
:
FROM ruby:3.1
COPY listen.rb .
RUN chmod +x listen.rb
ENTRYPOINT ["/listen.rb"]
If I run this container and forcibly kill the child:
% docker exec -it 3b53dc0dcbd5 bash
root@3b53dc0dcbd5:/# ps -ef
UID PID PPID C STIME TTY TIME CMD
root 1 0 0 23:36 pts/0 00:00:00 ruby /listen.rb
root 7 1 0 23:36 pts/0 00:00:00 ruby /listen.rb
root 8 0 0 23:36 pts/1 00:00:00 bash
root 14 8 0 23:36 pts/1 00:00:00 ps -ef
root@3b53dc0dcbd5:/# kill 7
root@3b53dc0dcbd5:/# %
I see:
7
pid 7 SIGTERM (signal 15)
/listen.rb:8:in `wait2': No child processes (Errno::ECHILD)
from /listen.rb:8:in `block in <main>'
from /listen.rb:7:in `loop'
from /listen.rb:7:in `<main>'
from puma.
Strange, this worked fine with Kubernetes. I repeated the test above with a Google Kubernetes Engine pod:
% kubectl run listen-test --image=registry.gitlab.com/stanhu/lfs-test/listen-test:latest
pod/listen-test created
% kubectl exec -it listen-test -- bash
root@listen-test:/# ps -ef
UID PID PPID C STIME TTY TIME CMD
root 1 0 0 05:44 ? 00:00:00 ruby /listen.rb
root 7 1 0 05:44 ? 00:00:00 ruby /listen.rb
root 21 0 0 05:46 pts/0 00:00:00 bash
root 27 21 0 05:46 pts/0 00:00:00 ps -ef
root@listen-test:/# kill 7
bash: kill: (7) - No such process
root@listen-test:/# command terminated with exit code 137
With kubectl logs -f listen-test
running, I see:
7
pid 7 SIGTERM (signal 15)
/listen.rb:8:in `wait2': No child processes (Errno::ECHILD)
from /listen.rb:8:in `block in <main>'
from /listen.rb:7:in `loop'
from /listen.rb:7:in `<main>'
I wonder what's different about Puma.
from puma.
Still can't replicate this issue with a simple pod running Puma:
Dockerfile.puma
FROM ruby:3.1
RUN gem install puma:6.4.1
COPY hello.ru .
ENTRYPOINT ["puma", "hello.ru", "-w", "2"]
hello.ru
hdrs = {'Content-Type'.freeze => 'text/plain'.freeze}.freeze
body = ['Hello World'.freeze].freeze
run lambda { |env| [200, hdrs, body] }
Process.wait2(-1, Process::WNOHANG)
is working fine in a pod as PID 1. I tried with a non-root user as well.
from puma.
Most likely we're trapping
SIGCHLD
and interfering with thewait
.
I only skimmed the code you linked quickly (got hundreds of mail to catch up to today), but this sound weird.
I'm not sure why trapping SIGCHLD
would make the wait
fail. But I suppose at this stage it's best to try to come up with a smaller repro so we can better understand what's going on, and see what we could do to make this more resilient.
from puma.
Yeah, I don't see any evidence we're trapping SIGCHLD
, and I've tried to add signal handlers to see if that changes anything. I can't reproduce the problem yet.
I did notice that the Ruby implementation for Process.wait2
seems to mention SIGCHLD
for some reason: https://github.com/ruby/ruby/blob/50c6cabadca44b7b034eae5dcc8017154a2858bd/process.c#L1343-L1348
from puma.
Interesting, that was removed in 3.3: ruby/ruby#7527
Seems like it was converting a blocking waitpid
into a non blocking one by waiting for SIGCHLD, sounds quite brittle to me, but I don't have the full context.
from puma.
Ok, it looks like in Ruby 3.1 and 3.2 Process.detach(<some PID != 1>
) appears to prevent Process.wait2(-1, Process::WNOHANG)
from finding child processes when the parent PID is 1.
The problem doesn't happen in Ruby 3.3. I wonder if ruby/ruby#7476 or ruby/ruby#7527 fixed this.
I'll update the comments in #3314 in light of this, so I think that pull request is still a good idea.
Here's a sample reproduction:
Dockerfile
FROM ruby:3.2
COPY listen.rb .
COPY test.sh .
RUN chmod +x listen.rb
RUN chmod +x test.sh
ENTRYPOINT ["/listen.rb"]
listen.rb
#!/bin/env ruby
fork do
loop { sleep 1 }
end
Process.spawn({}, "./test.sh", err: $stderr, out: $stdout, pgroup: true).tap do |pid|
STDERR.puts "detaching PID #{pid}"
Process.detach(pid)
end
loop do
STDERR.puts Process.wait2(-1, Process::WNOHANG)
sleep 1
end
test.sh
#!/bin/sh
sleep 600
from puma.
It appears that Process.detach
simply spawns a separate thread that does a blocking waitpid
. I think this uses the SIGCHLD
implementation introduced in ruby/ruby@054a412. This comment in the commit message is telling:
We also work to suppress false-positives from Process.wait(-1, Process::WNOHANG) to quiets warnings from spec/ruby/core/process/wait2_spec.rb with MJIT enabled.
This makes me think that this code is suppressing child PIDs when the parent is PID 1.
from puma.
I've confirmed that this Ruby SIGCHLD
business is responsible. I disabled WAITPID_USE_SIGCHLD
in my Ruby 3.1.4 interpreter, and Process.wait(-1, Process::WNOHANG)
started working again:
diff --git a/vm_core.h b/vm_core.h
index 1cc0659700..0e7d1643fe 100644
--- a/vm_core.h
+++ b/vm_core.h
@@ -126,7 +126,7 @@
#endif
/* define to 0 to test old code path */
-#define WAITPID_USE_SIGCHLD (RUBY_SIGCHLD || SIGCHLD_LOSSY)
+#define WAITPID_USE_SIGCHLD 0
#if defined(SIGSEGV) && defined(HAVE_SIGALTSTACK) && defined(SA_SIGINFO) && !defined(__NetBSD__)
# define USE_SIGALTSTACK
from puma.
It appears that only Process.detach
is only needed; PID 1 is not relevant. This Ruby script will get stuck in Ruby 3.1.4 and 3.2.2, but exits immediately in Ruby 3.3.0:
#!/bin/env ruby
forked_pid = fork do
loop { sleep 1 }
end
Process.spawn({}, "sh -c 'sleep 60'", err: $stderr, out: $stdout).tap do |pid|
puts "detaching PID #{pid}"
Process.detach(pid)
end
child_waiter = Thread.new do
puts "Waiting for child process to die..."
# This works
# puts Process.wait2(forked_pid)
# This fails in Ruby 3.1 and 3.2
puts Process.wait2(-1)
end
process_killer = Thread.new do
puts "Killing #{forked_pid}"
system("kill #{forked_pid}")
end
child_waiter.join
process_killer.join
from puma.
@stanhu I just realised I might be experiencing deja-vu with this thing. I have a bunch of notes/links at https://github.com/dentarg/gists/tree/master/gists/ruby-bug-15499#ruby--puma-bug about "the ruby 2.6.0 wait bug", some comments:
Looks like #1741 implemented a workaround. Has something changed in Ruby yet again?
from puma.
@dentarg Interesting! Given that ruby/ruby@054a412 was introduced in Ruby 2.6, I wonder if this broke Process.waitpid
in a number of situations. With Ruby 3.3, that SIGCHLD
implementation is gone, so I wonder if all these wait
-related issues can be fixed without workarounds.
I see https://github.com/puma/puma/pull/1741/files#r266122715 mentions Process.waitpid(-1, Process::WNOHANG)
was not working, and #3255 introduced this in Puma v6.4.0. This seems to work okay until you use Process.detach
, but maybe there are more situations where it doesn't work.
from puma.
I created https://bugs.ruby-lang.org/issues/20181, but I noticed https://bugs.ruby-lang.org/issues/19322 mentions this summary:
- Programs doing waitpid -1 are bad and wrong, nobody should ever do that, if any code in your program does this anywhere, then Ruby should no longer make any guarantees about subprocess management working correctly in the entire process.
- Programs doing waitpid -1 are widely deployed, it would be good if, when writing gems, there were APIs we could use which offer better isolation and composibility than the classic unix APIs, so that our gems work no matter what their containing processes are doing.
- Gems should never be spawning child processes anyway.
from puma.
@dentarg Interesting! Given that ruby/ruby@054a412 was introduced in Ruby 2.6, I wonder if this broke
Process.waitpid
in a number of situations. With Ruby 3.3, thatSIGCHLD
implementation is gone, so I wonder if all thesewait
-related issues can be fixed without workarounds.
Found another thing that probably relates to this? Just wanted to connect the dots.
- https://github.com/puma/puma/pull/2767/files#r872970473
Well, I'm stuck. First of all, I don't like this dirty hack. Besides that, it goes into an infinity loop on the SIGCHLD, I don't know exactly why.
- https://github.com/puma/puma/pull/2767/files#r873130086
I've just tested this hack independently of puma and it works fine on ruby 2.5 and fails on all next versions
from puma.
Related Issues (20)
- Windows: puma_http11.so breaking when ruby folder is relocated HOT 16
- Reactor suddenly throws NoMethodError but can't find reason HOT 6
- `--idle-timeout` should not shut down entire cluster if any single worker is idle HOT 1
- Puma responds with a 501 Server Error when a client sends invalid HTTP methods HOT 6
- Listen for HTTP and HTTPS on the same port, redirecting clients to HTTPS for easier local development HOT 4
- require puma/dsl not on the top HOT 2
- Phased restart & symlinks doesn't change destination folder? HOT 3
- Allow alternative schemes in Binder
- Support Ruby 3.3's `Process.warmup` before fork HOT 14
- Error responses written twice HOT 1
- Specify multiple licences in the gemspec? HOT 17
- High latency on puma HOT 2
- Puma and Puma Dev Linux issue
- Phased-Restart Causes Action Cable Connection Loss on Rails App Restart HOT 3
- Add a hook for "before shutdown" HOT 6
- Systemd watchdog kills puma during hot restart. HOT 4
- NoMethodError: undefined method `release' for Rack:Module error at "rails s" initialization time HOT 5
- Staging environment, request.ENV['HTTP_HOST'] displays 2 urls when it should be a single url HOT 1
- HttpParserError raised when body (chunk) exceeds 4096 bytes does not end with "\r\n" HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from puma.