Comments (4)
Ok, this is unfortunate, but not entirely surprising. I will try to look for
a good debugging strategy for this.
from flux-core.
I was finally able to get enough nodes on hype to reproduce. This definitely is reproducible, and it appears not to be the job that is hanging but flux-wreckrun
itself. From a backtrace, it appears that wreckrun
is stuck in a callback for a kzio
object (stdio stream).
I have never seen this kind of lockup before, so I am wondering if recent changes in
the lua code are causing this. I also wonder if @garlick has any insight into how we might
be blocked here. This could be related to #83, but I don't want to rule out a bug in lua
bindings since they were most recently touched.
(gdb) where
#0 0x00002aaaab8c81b3 in __poll (fds=<value optimized out>,
nfds=<value optimized out>, timeout=<value optimized out>)
at ../sysdeps/unix/sysv/linux/poll.c:87
#1 0x00002aaaac891386 in ?? () from /usr/lib64/libzmq.so.3
#2 0x00002aaaac87c6fe in ?? () from /usr/lib64/libzmq.so.3
#3 0x00002aaaac891d04 in ?? () from /usr/lib64/libzmq.so.3
#4 0x00002aaaac892140 in ?? () from /usr/lib64/libzmq.so.3
#5 0x00002aaaac8a60fa in ?? () from /usr/lib64/libzmq.so.3
#6 0x00002aaaacad23c5 in zframe_send () from /usr/lib64/libczmq.so.1
#7 0x00002aaaacad8e9d in zmsg_send () from /usr/lib64/libczmq.so.1
#8 0x00002aaaabfd8b7a in dq_put (dq=<value optimized out>,
zmsg=0x7fffffffcd38, typemask=<value optimized out>)
at ../../../../src/modules/api/libapi.c:147
#9 0x00002aaaabfddc4e in flux_response_matched_recvmsg (h=0x61bb90,
match=<value optimized out>, nb=false)
at ../../../../src/common/libflux/request.c:96
#10 0x00002aaaabfddded in flux_rpc (h=0x61bb90, request=<value optimized out>,
fmt=<value optimized out>) at ../../../../src/common/libflux/request.c:125
#11 0x00002aaaabfd7610 in kvs_get (h=0x61bb90, key=<value optimized out>,
valp=0x7fffffffceb8) at ../../../../src/modules/kvs/libkvs.c:286
#12 0x00002aaaabdb1526 in getnext (kz=0x650af0)
at ../../../../src/common/libzio/kz.c:233
#13 0x00002aaaabdb16b5 in kz_get (kz=0x650af0, datap=0x7fffffffcf08)
at ../../../../src/common/libzio/kz.c:293
#14 0x00002aaaabdaad9e in iowatcher_kz_ready_cb (kz=0x650af0,
arg=<value optimized out>) at ../../../../src/bindings/lua/flux-lua.c:1060
#15 0x00002aaaabdb104b in kvswatch_cb (key=<value optimized out>,
dir=<value optimized out>, arg=<value optimized out>,
errnum=<value optimized out>) at ../../../../src/common/libzio/kz.c:358
#16 0x00002aaaabfd5ecb in dispatch_watch (h=<value optimized out>,
wp=0x650c90, key=0x6b38f0 "lwj.3.25.stdout", val=<value optimized out>)
at ../../../../src/modules/kvs/libkvs.c:532
#17 0x00002aaaabfd610e in watch_rep_cb (h=0x61bb90,
typemask=<value optimized out>, zmsg=0x7fffffffd028, arg=0x646010)
at ../../../../src/modules/kvs/libkvs.c:557
#18 0x00002aaaabfdae97 in flux_handle_event_msg (h=0x61bb90, typemask=2, zmsg=0x7fffffffd028) at ../../../../src/common/libflux/handle.c:330
#19 0x00002aaaabfd8e3d in dq_resp_cb (zl=<value optimized out>,
item=<value optimized out>, arg=<value optimized out>)
at ../../../../src/modules/api/libapi.c:88
#20 0x00002aaaacad6e54 in zloop_start () from /usr/lib64/libczmq.so.1
#21 0x00002aaaabfd89e8 in cmb_reactor_start (impl=<value optimized out>)
at ../../../../src/modules/api/libapi.c:235
#22 0x00002aaaabdad511 in l_flux_reactor_start (L=0x603010)
at ../../../../src/bindings/lua/flux-lua.c:1308
#23 0x00002aaaaacd95a1 in ?? () from /usr/lib64/liblua-5.1.so
#24 0x00002aaaaace4229 in ?? () from /usr/lib64/liblua-5.1.so
#25 0x00002aaaaacd9a6d in ?? () from /usr/lib64/liblua-5.1.so
#26 0x00002aaaaacd9107 in ?? () from /usr/lib64/liblua-5.1.so
#27 0x00002aaaaacd9182 in ?? () from /usr/lib64/liblua-5.1.so
#28 0x00002aaaaacd4b61 in lua_pcall () from /usr/lib64/liblua-5.1.so
#29 0x0000000000401526 in ?? ()
#30 0x0000000000401fe9 in ?? ()
#31 0x00002aaaaacd95a1 in ?? () from /usr/lib64/liblua-5.1.so
#32 0x00002aaaaacd9a24 in ?? () from /usr/lib64/liblua-5.1.so
#33 0x00002aaaaacd9107 in ?? () from /usr/lib64/liblua-5.1.so
#34 0x00002aaaaacd9182 in ?? () from /usr/lib64/liblua-5.1.so
#35 0x00002aaaaacd4b07 in lua_cpcall () from /usr/lib64/liblua-5.1.so
#36 0x000000000040145f in ?? ()
#37 0x00002aaaab807d5d in __libc_start_main (main=0x401420, argc=5,
ubp_av=0x7fffffffd768, init=<value optimized out>,
fini=<value optimized out>, rtld_fini=<value optimized out>,
stack_end=0x7fffffffd758) at libc-start.c:226
#38 0x0000000000401289 in ?? ()
#39 0x00007fffffffd758 in ?? ()
#40 0x000000000000001c in ?? ()
#41 0x0000000000000005 in ?? ()
#42 0x00007fffffffdc68 in ?? ()
#43 0x00007fffffffdc75 in ?? ()
#44 0x00007fffffffdcaf in ?? ()
#45 0x00007fffffffdcb4 in ?? ()
#46 0x00007fffffffdcb9 in ?? ()
#47 0x0000000000000000 in ?? ()
from flux-core.
BTW, you can see that the job completed normally via the kvs after the fact, eg.
(flux-1104669.0-0) grondo@hype301:~/git/flux-core.git/b/src/cmd$ ./flux wreckrun -N32 -t16 hostname
wreckrun: 0.001s: Sending LWJ request for 1 tasks (cmdline "hostname")
wreckrun: 0.005s: Registered jobid 2
wreckrun: Allocating 512 tasks across 32 available nodes..
wreckrun: tasks per node: node[0-31]: 16
wreckrun: 0.013s: Sending run event
wreckrun: 0.065s: State = starting
# Hang here....
^\Quit (core dumped)
(flux-1104669.0-0) grondo@hype301:~/git/flux-core.git/b/src/cmd$ ./flux kvs get lwj.2.state
complete
You can also check stdio from arbitrary tasks:
(flux-1104669.0-0) grondo@hype301:~/git/flux-core.git/b/src/cmd$ ./flux zio --attach lwj.2.0
flux-zio: process attached to lwj.2.0
flux-zio: disabling stdin: File exists
hype301
(flux-1104669.0-0) grondo@hype301:~/git/flux-core.git/b/src/cmd$ ./flux zio --attach lwj.2.511
flux-zio: process attached to lwj.2.511
hype353
After creating the lwj
job entries and issuing a run event, flux-wreckrun
isn't actually required to stay running for successful job completion.
from flux-core.
This needs to be re-evaluated, and is quite out of date with respect to the current state of affairs. The specific bug here seems to have been fixed.
from flux-core.
Related Issues (20)
- test/kvs_txn.c build failure on aarch64 under rpmbuild HOT 2
- `flux job list-ids --wait-state=inactive` doesn't work
- broker: `parent-uri` attribute is set for instance with no parent HOT 7
- flux-job: `client.c:432: pty_read_cb: read: Input/output error` when nesting jobs with interactive ptys HOT 4
- updating the flux message protocol will result in dropped messages when versions are mismatched
- flux in the last year HOT 1
- not ok 3 - fileref_create chunksize=0 'a-aa' works (2 sha1 blobrefs)
- Consistent support for `--quiet` for submit and batch HOT 1
- libflux: unauthorized requests are not responded to unless they have a matchtag
- Flux 0.57.0, `flux filemap get` fails if file was mapped with multiple chunks HOT 7
- KVS access to job's private namespace hangs HOT 8
- invalid userid in exception log entry HOT 3
- doc: flux_requeue(3) error in manpage
- fully remove flux-mini
- flux-broker: stdin is not a tty - can't run interactive shell HOT 7
- job-list: limit size of job constraint
- build: missing use of `JANSSON_CFLAGS` in Makefile?
- priority plugin posting identical jobspec-update event twice HOT 5
- Expected warnings with Python 3.12 HOT 1
- broker: allow non-local exec to rank 0 in single-user instances HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from flux-core.