We've had a couple rank 0 broker crashes on elcapi today. On the second we got a coref

There's a good probability this was fixed by <a class="issue-link js-issue-link" data-

Another crash this morning <div class="snippet-clipboard-content notranslate posit

rank 0 broker crash on elcapi about flux-core HOT 4 CLOSED

grondo commented on July 17, 2024

rank 0 broker crash on elcapi

from flux-core.

Comments (4)

grondo commented on July 17, 2024 1

Just noticed another thread was active at the time of the crash:

#0  0x00007ffff62338e6 in malloc_consolidate () from /lib64/libc.so.6
#1  0x00007ffff6235080 in _int_free () from /lib64/libc.so.6
#2  0x00007fffc6bdb2b2 in fzhashx_destroy (self_p=0x7fffaed030c0)
    at zhashx.c:162
#3  0x00007fffc6bcf3c6 in rlist_destroy (rl=0x7fffaed030b0) at rlist.c:115
#4  rlist_destroy (rl=0x7fffaed030b0) at rlist.c:109
#5  0x00007fffc6bcbac9 in prepare_sched_status_payload (
    status=<optimized out>, allocated=<optimized out>) at status.c:337
#6  sched_status_continuation (f=<optimized out>, arg=<optimized out>)
    at status.c:383
#7  0x00007ffff7b94a53 in ev_invoke_pending (loop=0x7fffb8002450) at ev.c:3770
#8  0x00007ffff7b98af8 in ev_run (flags=0, loop=0x7fffb8002450) at ev.c:4190
#9  ev_run (loop=0x7fffb8002450, flags=0) at ev.c:4021
#10 0x00007ffff7b677bf in flux_reactor_run (r=0x7fffb8002430, 
    flags=flags@entry=0) at reactor.c:124
#11 0x00007fffc6bc521e in mod_main (h=0x7fffb8000be0, argc=<optimized out>, 
    argv=<optimized out>) at resource.c:469
#12 0x0000000000411eef in module_thread (arg=0xe693b0) at module.c:209
#13 0x00007ffff792c1ca in start_thread () from /lib64/libpthread.so.0
#14 0x00007ffff61d5e73 in clone () from /lib64/libc.so.6

from flux-core.

grondo commented on July 17, 2024 1

There's a good probability this was fixed by #5937. Let's close this one and reopen a new issue if a crash occurs after 0.62.0 is released.

from flux-core.

grondo commented on July 17, 2024

Another crash this morning

Apr 28 05:33:09 elcap1 flux[1404111]: job-exec.info[0]: job-exception: id=fs7qNLbkNYF: resource allocation expired
Apr 28 06:04:11 elcap1 flux[1404111]: job-exec.info[0]: job-exception: id=fs85XSXPoqy: resource allocation expired
Apr 28 07:34:06 elcap1 flux[1404111]: malloc_consolidate(): unaligned fastbin chunk detected
Apr 28 07:34:07 elcap1 systemd[1]: flux.service: Main process exited, code=killed, status=6/ABRT
Apr 28 07:34:07 elcap1 systemd[1]: flux.service: Failed with result 'signal'.

For some reason no coredump is listed in coredumpctl

from flux-core.

grondo commented on July 17, 2024

One difference between a typical large-scale test environment and a true system instance is that a system instance has to remotely execute flux-imp kill SIGNAL PID to terminate remote shells owned by other users, rather than using flux_subprocess_kill(3) directly.

Given this, I modified a version of flux to simulate this by executing kill -SIGNAL PID instead of using flux_subprocess_kill(3) and was able to reproduce a crash even at size=320.

Use of jemalloc with junk:true produced a stack trace that called out a use-after-free of a struct bulk_exec in exit_batch_cb. This led to the discovery that an active exit_batch_timer is not destroyed in the bulk_exec_destroy(), which is clearly wrong. We probably get away with this most of the time because the timer is never active after all processes in a bulk-exec have exited. However, in the bulk_exec_imp_kill() implementation, destruction of the bulk_exec object is tied to the future returned from the function, so it is possible there is a race that causes destruction of the future while a timer is still active.

from flux-core.

rank 0 broker crash on elcapi about flux-core HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent