Comments (15)
@TheNumbat I just tried your proposed fix (ocaml/ocaml#12886) and unfortunately it doesn't fix this issue, so maybe they are not the same issue after all?
from eio.
Can you get a strace
(or mac equivalent) log for it? Does it still happen with n_domains = 1
?
Could also try with OCAMLRUNPARAM=b
, though I guess the error is trying to read the process output. The main interesting questions are: is it reading from the correct FD, and did it close the FD too soon?
from eio.
Does it still happen with
n_domains = 1
?
No, it does not fail with n_domains = 1
.
Interestingly, setting OCAMLRUNPARAM=b
makes the failures occur about 3x less frequently, similar to running the stress test on the pool-systhreads
branch. It seems the extra overhead creates enough "spacing" to make the race condition less frequent?
Similarly, sudo dtruss _build/default/stress/stress_proc.exe
(dtruss
being the strace
equivalent) slows down or creates enough synchronization to make the bug simply not occur when observed under dtruss
.
I've tried with more domains, fewer domains, more procs per domain, all kinds of combinations, etc. I can't get it to crash under dtruss
. Do you think there's any value in looking at the dtruss
output of successful runs?
Could also try with OCAMLRUNPARAM=b, though I guess the error is trying to read the process output.
On current main
(a26c3ebccc5dcf1480aa72c39dd645b43ab32db2
), the backtrace is:
Fatal error: exception Failure("Unexpected end-of-file at offset 0")
Raised at Stdlib.failwith in file "stdlib.ml", line 29, characters 17-33
Called from Eio__Process.parse_out.(fun) in file "lib_eio/process.ml", line 154, characters 17-61
Called from Eio__core__Switch.run_internal in file "lib_eio/core/switch.ml", line 134, characters 8-12
Re-raised at Eio__core__Switch.maybe_raise_exs in file "lib_eio/core/switch.ml", line 118, characters 21-56
Called from Eio__core__Switch.run_internal in file "lib_eio/core/switch.ml", line 150, characters 4-21
Called from Eio__core__Cancel.with_cc in file "lib_eio/core/cancel.ml", line 115, characters 8-12
Re-raised at Eio__core__Cancel.with_cc in file "lib_eio/core/cancel.ml", line 117, characters 32-40
Called from Dune__exe__Stress_proc.run_in_domain.(fun) in file "stress/stress_proc.ml", line 12, characters 21-27
Called from Eio__core__Fiber.fork.(fun) in file "lib_eio/core/fiber.ml", line 20, characters 10-14
Re-raised at Eio__core__Switch.maybe_raise_exs in file "lib_eio/core/switch.ml", line 118, characters 21-56
Called from Eio__core__Switch.run_internal in file "lib_eio/core/switch.ml", line 139, characters 4-21
Called from Eio__core__Cancel.with_cc in file "lib_eio/core/cancel.ml", line 115, characters 8-12
Re-raised at Eio__core__Cancel.with_cc in file "lib_eio/core/cancel.ml", line 117, characters 32-40
Called from Eio__core__Fiber.any.(fun).wrap in file "lib_eio/core/fiber.ml", line 97, characters 16-20
Re-raised at Eio__core__Fiber.any in file "lib_eio/core/fiber.ml", line 138, characters 26-61
Called from Eio__Domain_manager.run.(fun) in file "lib_eio/domain_manager.ml", line 29, characters 4-231
Re-raised at Eio__Domain_manager.run.(fun) in file "lib_eio/domain_manager.ml", line 43, characters 11-19
Called from Eio_posix__Domain_mgr.wrap_backtrace in file "lib_eio_posix/domain_mgr.ml", line 80, characters 8-12
Re-raised at Eio_posix__Domain_mgr.unwrap_backtrace in file "lib_eio_posix/domain_mgr.ml", line 88, characters 22-57
Called from Eio__core__Fiber.fork.(fun) in file "lib_eio/core/fiber.ml", line 20, characters 10-14
Re-raised at Eio__core__Switch.maybe_raise_exs in file "lib_eio/core/switch.ml", line 118, characters 21-56
Called from Eio__core__Switch.run_internal in file "lib_eio/core/switch.ml", line 139, characters 4-21
Called from Eio__core__Cancel.with_cc in file "lib_eio/core/cancel.ml", line 115, characters 8-12
Re-raised at Eio__core__Cancel.with_cc in file "lib_eio/core/cancel.ml", line 117, characters 32-40
Called from Dune__exe__Stress_proc.main in file "stress/stress_proc.ml", line 21, characters 4-232
Called from Eio_posix__Sched.with_op in file "lib_eio_posix/sched.ml", line 310, characters 8-12
Re-raised at Eio_posix__Sched.with_op in file "lib_eio_posix/sched.ml", line 316, characters 4-12
Called from Eio_posix__Sched.run.(fun) in file "lib_eio_posix/sched.ml", line 368, characters 25-43
Re-raised at Eio_posix__Sched.run.fork.(fun) in file "lib_eio_posix/sched.ml", line 331, characters 12-76
Called from Stdlib__Fun.protect in file "fun.ml", line 33, characters 8-15
Re-raised at Stdlib__Fun.protect in file "fun.ml", line 38, characters 6-52
Called from Eio_posix__Sched.run in file "lib_eio_posix/sched.ml", line 364, characters 4-217
Called from Eio_posix__Sched.with_sched in file "lib_eio_posix/sched.ml", line 250, characters 8-12
Re-raised at Eio_posix__Sched.with_sched in file "lib_eio_posix/sched.ml", line 255, characters 4-39
Called from Dune__exe__Stress_proc in file "stress/stress_proc.ml", line 34, characters 2-71
from eio.
It would be good to find a way of making it fail more often. e.g. the example runs a shell; what happens if you call /bin/echo
directly? Or, if making the process faster doesn't help, what about the reverse and putting a sleep
in it?
I tried running it on Linux with eio_posix and tsan and that didn't report anything.
Another way to make it fail sooner might be to replace the use of the high-level Eio.Process
API with Eio_posix.Low_level.Process
. The would also narrow down where the error could be.
from eio.
-
- If there is a correlation between the number of domains and the failure rate, then that correlation is a fairly weak one
-
- The fastest and most reliably (on average) I've been able to make it fail is
["sh"; "-c"; "sleep 0.25 && echo " ^ string_of_int n]
- The fastest and most reliably (on average) I've been able to make it fail is
-
- Calling
/bin/echo
directly is fast enough to avoid the race condition
- but it fails almost as much as ii. if I have 3 cores nearly maxed out with 3
yes > /dev/null &
running in the background
- Calling
-
- I noticed that if I start recompiling
stress_proc.exe
while it's running, the "jitter" introduced by the sudden addition of a CPU-bound process (and probably the OS moving threads around between cores) is enough to make it instantly fail 50% of the time.
- I noticed that if I start recompiling
-
- Running it under TSAN often (but not always) makes it stall (?) after 10 seconds? It's still consuming CPU but not visibly progressing through test rounds. Sometimes it also fails the same way as it does without TSAN.
Here's my version using Eio_posix.Low_level.Process, it fails with WSIGNALED SIGILL
. Sometimes after 20+ rounds, often in under 5 rounds.
ASAN fails almost immediately with
==35289==ERROR: AddressSanitizer failed to deallocate 0x20000 (131072) bytes at address 0x000104f31800
AddressSanitizer: CHECK failed: sanitizer_posix.cpp:61 "(("unable to unmap" && 0)) != (0)" (0x0, 0x0) (tid=2740376)
<empty stack>
Illegal instruction: 4
from eio.
So the child process is dying with SIGILL and your failwith
is getting called to display the error? Do you get a coredump with that?
It looks like it must be either the forked launcher (e.g. action_execve
) crashing or sh
itself (which seems unlikely).
What happens if you remove the pipe stuff (inherit_fds
, read_all pipe1_r
)?
from eio.
(I edited the previous message with ASAN information, make sure to refresh the page)
Yeah, child process dies with SIGILL and failwith is just used to display the error.
I do not get a coredump.
It fails the same way when I use bash
instead of sh
. Same thing when I remove everything pipe-related.
from eio.
Maybe coredumps need to be enabled: https://stackoverflow.com/a/2080938/50926
(not sure if OCaml is expected to work with ASAN - does it work with other programs?)
from eio.
I've had ASAN successfully detect a race condition in C++ code invoked from OCaml before.
I totally forgot about ulimit -c unlimited
! Once enabled, I got a 3.9Gb core dump. Using lldb
(since gdb
doesn't understand MacOS core dumps):
lldb _build/default/stress/stress_proc.exe -c /cores/core.91104
I got the following backtrace!
* thread #1
* frame #0: 0x00007ff811be7f23 libsystem_platform.dylib`_os_unfair_lock_corruption_abort + 37
frame #1: 0x00007ff811be22e4 libsystem_platform.dylib`_os_unfair_lock_lock_slow + 257
frame #2: 0x00007ff811a9e013 libsystem_c.dylib`localeconv_l + 55
frame #3: 0x00007ff811a99b93 libsystem_c.dylib`__vfprintf + 113
frame #4: 0x00007ff811acc868 libsystem_c.dylib`_vsnprintf + 249
frame #5: 0x00007ff811aa72d0 libsystem_c.dylib`vsnprintf_l + 41
frame #6: 0x00007ff811aa7389 libsystem_c.dylib`snprintf + 156
frame #7: 0x00007ff81c6948d2 libsystem_coreservices.dylib`_dirhelper_init + 204
frame #8: 0x00007ff811be1726 libsystem_platform.dylib`_os_once_callout + 18
frame #9: 0x00007ff81c9c7c7f libSystem.B.dylib`libSystem_atfork_child + 48
frame #10: 0x00007ff811ac282d libsystem_c.dylib`fork + 84
frame #11: 0x000000010fb2f6d1 stress_proc.exe`caml_eio_posix_spawn(v_errors=271, v_actions=<unavailable>) at eio_posix_stubs.c:390:15 [opt]
frame #12: 0x000000010fb6c4d7 stress_proc.exe`caml_c_call + 27
frame #13: 0x000000010fa6cbf7 stress_proc.exe`camlEio_posix__Low_level.fun_2486 + 23
frame #14: 0x000000010fa74d3a stress_proc.exe`camlEio_unix__Rcfd.use_609 + 106
frame #15: 0x000000010fa6ca4d stress_proc.exe`camlEio_posix__Low_level.fun_2474 + 189
frame #16: 0x000000010faa3a31 stress_proc.exe`camlEio__core__Switch.run_internal_1059 + 49
frame #17: 0x000000010faa1822 stress_proc.exe`camlEio__core__Cancel.with_cc_1009 + 210
frame #18: 0x000000010fa66895 stress_proc.exe`camlDune__exe__Stress_proc._echo2_532 + 421
frame #19: 0x000000010fa66a23 stress_proc.exe`camlDune__exe__Stress_proc.fun_1442 + 35
frame #20: 0x000000010faa8843 stress_proc.exe`camlEio__core__Fiber.fun_1348 + 67
frame #21: 0x000000010fb128cb stress_proc.exe`camlStdlib__Fun.protect_320 + 107
frame #22: 0x000000010fb6c804 stress_proc.exe`caml_runstack + 68
frame #23: 0x000000010fb128cb stress_proc.exe`camlStdlib__Fun.protect_320 + 107
frame #24: 0x000000010fa693dc stress_proc.exe`camlEio_posix__Sched.run_1324 + 284
frame #25: 0x000000010fa68aef stress_proc.exe`camlEio_posix__Sched.with_sched_1189 + 543
frame #26: 0x000000010fa66f0a stress_proc.exe`camlDune__exe__Stress_proc.entry + 250
frame #27: 0x000000010fa638cb stress_proc.exe`caml_program + 2059
frame #28: 0x000000010fb6c590 stress_proc.exe`caml_start_program + 112
frame #29: 0x000000010fb6bdcb stress_proc.exe`caml_startup_common(argv=0x00007ff7b049d1e8, pooling=<unavailable>) at startup_nat.c:132:9 [opt]
frame #30: 0x000000010fb6be3b stress_proc.exe`caml_main [inlined] caml_startup_exn(argv=<unavailable>) at startup_nat.c:139:10 [opt]
frame #31: 0x000000010fb6be34 stress_proc.exe`caml_main [inlined] caml_startup(argv=<unavailable>) at startup_nat.c:144:15 [opt]
frame #32: 0x000000010fb6be34 stress_proc.exe`caml_main(argv=<unavailable>) at startup_nat.c:151:3 [opt]
frame #33: 0x000000010fb592ac stress_proc.exe`main(argc=<unavailable>, argv=<unavailable>) at main.c:37:3 [opt]
frame #34: 0x000000011bdc252e dyld`start + 462
So yeah, this line.
I'm now trying to find out what can cause _os_unfair_lock_corruption_abort
from eio.
It's difficult to find information on it where the _os_unfair_lock_corruption_abort
error comes from fork
, but I found this: https://developer.apple.com/forums/thread/737464
Also this: https://hacks.mozilla.org/2022/10/improving-firefox-responsiveness-on-macos/
beware that it doesn’t play well with fork(). That’s because the lock stores internally the mach thread IDs to ensure consistent acquisition and release.
These IDs change after a call to fork() as the thread creates new ones when copying your process’ threads. This can lead to potential crashes in the child process. If your application uses fork(), or your library needs to be fork()-safe you’ll need to register at-fork handlers using pthread_atfork() to acquire all the locks in the parent before the fork, then release them after the fork (also in the parent), and reset them in the child.
However:
eio/lib_eio/unix/fork_action.c
Lines 1 to 5 in 5e014fc
EDIT: This OCaml 4.14 issue contains information regarding the pitfalls of mixing fork
and pthreads, and advocates for posix_spwawn
instead of fork
: ocaml/ocaml#12636
More interesting discussions: ocaml/ocaml-lsp#1194 and ocaml/dune#8090
from eio.
@talex5 this PR (less than 30 minutes old!) may be related: ocaml/ocaml#12886
Quoting myself:
but it fails almost as much as ii. if I have 3 cores nearly maxed out with 3
yes > /dev/null &
running in the background
The PR mentions thread switches causing issues, and that extra load of yes
could definitely explain that.
from eio.
Yeah, this sounds related. I figured out that code was missing while debugging a forked process that aborted after failing to lock a locked pthreads mutex. The macos equivalent appears to be _os_unfair_lock_corruption_abort
.
from eio.
Does that only apply to Unix.fork
? We're not making any use of OCaml after forking, so I think it shouldn't matter. But the stacktrace shows macos using localeconv
within the call to fork
, which seems bad.
https://stackoverflow.com/questions/58076064/forkexec-process-with-threads-broken-on-macos-children-process-stuck-in-atf says "So far it seems that fork in multi-threaded environment is just broken on macOS."
http://sealiesoftware.com/blog/archive/2017/6/5/Objective-C_and_fork_in_macOS_1013.html suggests that OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES
might help.
from eio.
OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES
did not make a difference, I tried it both at build time and run time.
Every solution to this problem seems to mention the need to use posix_spawn
on MacOS instead of fork
, I haven't looked at what that would entail yet.
from eio.
I cannot explain it with certainty, but the problem is gone. I've tried to replicate it on the Eio commit where I first encountered the issue. It's not happening anymore, whether I use the OCaml trunk branch or any of the numerous open OCaml PRs addressing various systhreads issues.
I've installed a large OS update recently (same major version though) and I'm going to tentatively assume it contained a fix or mitigation in some way.
I'll reopen this issue if it reoccurs.
from eio.
Related Issues (20)
- Compilation failure HOT 7
- api questions and some migration feedback HOT 7
- Full tracing support
- `getsockname` missing? HOT 3
- Switches should be domain-safe
- Eio_unix: Allocation warning from sched.ml breaks cram tests HOT 2
- Unix.Unix_error(Unix.EPERM, "io_uring_queue_init", "") failures in Docker HOT 5
- Occasional deadlocks with systhreads
- Multiple message in TCP communication with flow HOT 3
- Explanation of "connect-in-progress" HOT 3
- Net.listen hangs if the port is already in use HOT 2
- Extending eio from the outside HOT 1
- Signal handlers may not run, especially with eio_linux
- Support: sockets as flows with compatible poly variants HOT 1
- Fails to build with musl-gcc w/ linux-lts headers (no `struct clone_args`) HOT 1
- Eio.Path.load causes "file descriptor used after calling close!" error HOT 7
- Path.split does not work with backslash-encoded paths. HOT 1
- Relax the return types of Eio.Process.pipe HOT 2
- data-race/rare crash with with IO HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from eio.