Comments (4)
Just noticed another thread was active at the time of the crash:
#0 0x00007ffff62338e6 in malloc_consolidate () from /lib64/libc.so.6
#1 0x00007ffff6235080 in _int_free () from /lib64/libc.so.6
#2 0x00007fffc6bdb2b2 in fzhashx_destroy (self_p=0x7fffaed030c0)
at zhashx.c:162
#3 0x00007fffc6bcf3c6 in rlist_destroy (rl=0x7fffaed030b0) at rlist.c:115
#4 rlist_destroy (rl=0x7fffaed030b0) at rlist.c:109
#5 0x00007fffc6bcbac9 in prepare_sched_status_payload (
status=<optimized out>, allocated=<optimized out>) at status.c:337
#6 sched_status_continuation (f=<optimized out>, arg=<optimized out>)
at status.c:383
#7 0x00007ffff7b94a53 in ev_invoke_pending (loop=0x7fffb8002450) at ev.c:3770
#8 0x00007ffff7b98af8 in ev_run (flags=0, loop=0x7fffb8002450) at ev.c:4190
#9 ev_run (loop=0x7fffb8002450, flags=0) at ev.c:4021
#10 0x00007ffff7b677bf in flux_reactor_run (r=0x7fffb8002430,
flags=flags@entry=0) at reactor.c:124
#11 0x00007fffc6bc521e in mod_main (h=0x7fffb8000be0, argc=<optimized out>,
argv=<optimized out>) at resource.c:469
#12 0x0000000000411eef in module_thread (arg=0xe693b0) at module.c:209
#13 0x00007ffff792c1ca in start_thread () from /lib64/libpthread.so.0
#14 0x00007ffff61d5e73 in clone () from /lib64/libc.so.6
from flux-core.
There's a good probability this was fixed by #5937. Let's close this one and reopen a new issue if a crash occurs after 0.62.0 is released.
from flux-core.
Another crash this morning
Apr 28 05:33:09 elcap1 flux[1404111]: job-exec.info[0]: job-exception: id=fs7qNLbkNYF: resource allocation expired
Apr 28 06:04:11 elcap1 flux[1404111]: job-exec.info[0]: job-exception: id=fs85XSXPoqy: resource allocation expired
Apr 28 07:34:06 elcap1 flux[1404111]: malloc_consolidate(): unaligned fastbin chunk detected
Apr 28 07:34:07 elcap1 systemd[1]: flux.service: Main process exited, code=killed, status=6/ABRT
Apr 28 07:34:07 elcap1 systemd[1]: flux.service: Failed with result 'signal'.
For some reason no coredump is listed in coredumpctl
from flux-core.
One difference between a typical large-scale test environment and a true system instance is that a system instance has to remotely execute flux-imp kill SIGNAL PID
to terminate remote shells owned by other users, rather than using flux_subprocess_kill(3)
directly.
Given this, I modified a version of flux to simulate this by executing kill -SIGNAL PID
instead of using flux_subprocess_kill(3)
and was able to reproduce a crash even at size=320.
Use of jemalloc with junk:true
produced a stack trace that called out a use-after-free of a struct bulk_exec
in exit_batch_cb
. This led to the discovery that an active exit_batch_timer
is not destroyed in the bulk_exec_destroy()
, which is clearly wrong. We probably get away with this most of the time because the timer is never active after all processes in a bulk-exec have exited. However, in the bulk_exec_imp_kill()
implementation, destruction of the bulk_exec
object is tied to the future returned from the function, so it is possible there is a race that causes destruction of the future while a timer is still active.
from flux-core.
Related Issues (20)
- sdexec: take extra measures to ensure cleanup HOT 7
- sdexec: does not comply with RFC42 protocol
- broker: add timezone designator to log timestamps
- fluxion logs resource status changed for individual nodes HOT 2
- t2410-sdexec-memlimit.t hangs after job-exec switched to FLUX_SUBPROCESS_FLAGS_UNBUF HOT 3
- user feedback on error messages
- `flux overlay status` is slow on large systems
- sdexec: add stdin buffering HOT 2
- sdexec: broker segfault in outbuf_mark_free HOT 2
- flux-start silently ignores `--recovery` when `-s, --test-size` is also present
- python: `jobspec.setattr()` should probably default to `attributes.system` like the `--setattr` command line option
- pmi: MPI job working in v0.55 fails in v0.63 HOT 2
- job-list: support `ranks` constraint
- shell: doom: include hostname of rank that caused early exit if possible HOT 1
- job-manager possibly sends alloc requests after jobs have been canceled HOT 1
- shell: add hostnames to errors where possible HOT 1
- more detailed task exit status reporting
- Run administrative epilog even if job is canceled before starting HOT 1
- job-exec: valgrind error and hang running simple job HOT 1
- job-manager: underflow in alloc request to scheduler count HOT 5
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from flux-core.