Giter VIP home page Giter VIP logo

Comments (4)

grondo avatar grondo commented on May 28, 2024

Ok, this is unfortunate, but not entirely surprising. I will try to look for
a good debugging strategy for this.

from flux-core.

grondo avatar grondo commented on May 28, 2024

I was finally able to get enough nodes on hype to reproduce. This definitely is reproducible, and it appears not to be the job that is hanging but flux-wreckrun itself. From a backtrace, it appears that wreckrun is stuck in a callback for a kzio object (stdio stream).

I have never seen this kind of lockup before, so I am wondering if recent changes in
the lua code are causing this. I also wonder if @garlick has any insight into how we might
be blocked here. This could be related to #83, but I don't want to rule out a bug in lua
bindings since they were most recently touched.

(gdb) where
#0  0x00002aaaab8c81b3 in __poll (fds=<value optimized out>, 
    nfds=<value optimized out>, timeout=<value optimized out>)
    at ../sysdeps/unix/sysv/linux/poll.c:87
#1  0x00002aaaac891386 in ?? () from /usr/lib64/libzmq.so.3
#2  0x00002aaaac87c6fe in ?? () from /usr/lib64/libzmq.so.3
#3  0x00002aaaac891d04 in ?? () from /usr/lib64/libzmq.so.3
#4  0x00002aaaac892140 in ?? () from /usr/lib64/libzmq.so.3
#5  0x00002aaaac8a60fa in ?? () from /usr/lib64/libzmq.so.3
#6  0x00002aaaacad23c5 in zframe_send () from /usr/lib64/libczmq.so.1
#7  0x00002aaaacad8e9d in zmsg_send () from /usr/lib64/libczmq.so.1
#8  0x00002aaaabfd8b7a in dq_put (dq=<value optimized out>, 
    zmsg=0x7fffffffcd38, typemask=<value optimized out>)
    at ../../../../src/modules/api/libapi.c:147
#9  0x00002aaaabfddc4e in flux_response_matched_recvmsg (h=0x61bb90, 
    match=<value optimized out>, nb=false)
    at ../../../../src/common/libflux/request.c:96
#10 0x00002aaaabfddded in flux_rpc (h=0x61bb90, request=<value optimized out>, 
    fmt=<value optimized out>) at ../../../../src/common/libflux/request.c:125
#11 0x00002aaaabfd7610 in kvs_get (h=0x61bb90, key=<value optimized out>, 
    valp=0x7fffffffceb8) at ../../../../src/modules/kvs/libkvs.c:286
#12 0x00002aaaabdb1526 in getnext (kz=0x650af0)
    at ../../../../src/common/libzio/kz.c:233
#13 0x00002aaaabdb16b5 in kz_get (kz=0x650af0, datap=0x7fffffffcf08)
    at ../../../../src/common/libzio/kz.c:293
#14 0x00002aaaabdaad9e in iowatcher_kz_ready_cb (kz=0x650af0, 
    arg=<value optimized out>) at ../../../../src/bindings/lua/flux-lua.c:1060
#15 0x00002aaaabdb104b in kvswatch_cb (key=<value optimized out>, 
    dir=<value optimized out>, arg=<value optimized out>, 
    errnum=<value optimized out>) at ../../../../src/common/libzio/kz.c:358
#16 0x00002aaaabfd5ecb in dispatch_watch (h=<value optimized out>, 
    wp=0x650c90, key=0x6b38f0 "lwj.3.25.stdout", val=<value optimized out>)
    at ../../../../src/modules/kvs/libkvs.c:532
#17 0x00002aaaabfd610e in watch_rep_cb (h=0x61bb90, 
    typemask=<value optimized out>, zmsg=0x7fffffffd028, arg=0x646010)
    at ../../../../src/modules/kvs/libkvs.c:557
#18 0x00002aaaabfdae97 in flux_handle_event_msg (h=0x61bb90, typemask=2,   zmsg=0x7fffffffd028) at ../../../../src/common/libflux/handle.c:330
#19 0x00002aaaabfd8e3d in dq_resp_cb (zl=<value optimized out>, 
    item=<value optimized out>, arg=<value optimized out>)
    at ../../../../src/modules/api/libapi.c:88
#20 0x00002aaaacad6e54 in zloop_start () from /usr/lib64/libczmq.so.1
#21 0x00002aaaabfd89e8 in cmb_reactor_start (impl=<value optimized out>)
    at ../../../../src/modules/api/libapi.c:235
#22 0x00002aaaabdad511 in l_flux_reactor_start (L=0x603010)
    at ../../../../src/bindings/lua/flux-lua.c:1308
#23 0x00002aaaaacd95a1 in ?? () from /usr/lib64/liblua-5.1.so
#24 0x00002aaaaace4229 in ?? () from /usr/lib64/liblua-5.1.so
#25 0x00002aaaaacd9a6d in ?? () from /usr/lib64/liblua-5.1.so
#26 0x00002aaaaacd9107 in ?? () from /usr/lib64/liblua-5.1.so
#27 0x00002aaaaacd9182 in ?? () from /usr/lib64/liblua-5.1.so
#28 0x00002aaaaacd4b61 in lua_pcall () from /usr/lib64/liblua-5.1.so
#29 0x0000000000401526 in ?? ()
#30 0x0000000000401fe9 in ?? ()
#31 0x00002aaaaacd95a1 in ?? () from /usr/lib64/liblua-5.1.so
#32 0x00002aaaaacd9a24 in ?? () from /usr/lib64/liblua-5.1.so
#33 0x00002aaaaacd9107 in ?? () from /usr/lib64/liblua-5.1.so
#34 0x00002aaaaacd9182 in ?? () from /usr/lib64/liblua-5.1.so
#35 0x00002aaaaacd4b07 in lua_cpcall () from /usr/lib64/liblua-5.1.so
#36 0x000000000040145f in ?? ()
#37 0x00002aaaab807d5d in __libc_start_main (main=0x401420, argc=5, 
    ubp_av=0x7fffffffd768, init=<value optimized out>, 
    fini=<value optimized out>, rtld_fini=<value optimized out>, 
    stack_end=0x7fffffffd758) at libc-start.c:226
#38 0x0000000000401289 in ?? ()
#39 0x00007fffffffd758 in ?? ()
#40 0x000000000000001c in ?? ()
#41 0x0000000000000005 in ?? ()
#42 0x00007fffffffdc68 in ?? ()
#43 0x00007fffffffdc75 in ?? ()
#44 0x00007fffffffdcaf in ?? ()
#45 0x00007fffffffdcb4 in ?? ()
#46 0x00007fffffffdcb9 in ?? ()
#47 0x0000000000000000 in ?? ()

from flux-core.

grondo avatar grondo commented on May 28, 2024

BTW, you can see that the job completed normally via the kvs after the fact, eg.

(flux-1104669.0-0) grondo@hype301:~/git/flux-core.git/b/src/cmd$ ./flux wreckrun -N32 -t16 hostname
wreckrun: 0.001s: Sending LWJ request for 1 tasks (cmdline "hostname")
wreckrun: 0.005s: Registered jobid 2
wreckrun: Allocating 512 tasks across 32 available nodes..
wreckrun: tasks per node: node[0-31]: 16
wreckrun: 0.013s: Sending run event
wreckrun: 0.065s: State = starting
# Hang here....
^\Quit (core dumped)
(flux-1104669.0-0) grondo@hype301:~/git/flux-core.git/b/src/cmd$ ./flux kvs get lwj.2.state
complete

You can also check stdio from arbitrary tasks:

(flux-1104669.0-0) grondo@hype301:~/git/flux-core.git/b/src/cmd$ ./flux zio --attach lwj.2.0
flux-zio: process attached to lwj.2.0
flux-zio: disabling stdin: File exists
hype301
(flux-1104669.0-0) grondo@hype301:~/git/flux-core.git/b/src/cmd$ ./flux zio --attach lwj.2.511
flux-zio: process attached to lwj.2.511
hype353

After creating the lwj job entries and issuing a run event, flux-wreckrun isn't actually required to stay running for successful job completion.

from flux-core.

trws avatar trws commented on May 28, 2024

This needs to be re-evaluated, and is quite out of date with respect to the current state of affairs. The specific bug here seems to have been fixed.

from flux-core.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.