Comments (7)
Well I guess a test failed and that uncovered a bug in the module, so that's a good outcome compared to before. But ideally we would log something, avoid the hang, and exit with a nonzero exit code. Maybe we could manage:
job-manager: module leaked a socket reference - was a future leaked?
broker: skipping 0MQ shutdown due to presumed module socket leak
from flux-core.
The zmq_ctx_term documentation states:
After interrupting all blocking calls, zmq_ctx_term() shall block until the following conditions are satisfied:
- All sockets open within context have been closed with zmq_close().
- For each socket within context, all messages sent by the application with zmq_send() have either been physically transferred to a network peer, or the socket's linger period set with the ZMQ_LINGER socket option has expired.
So perhaps there is an outstanding socket still open or a message stuck?
from flux-core.
If I "comment out" the last test in t1000-dws-dependencies.t
, then the hang does not occur. In fact, the following test alone reproduces the issue by itself:
#!/bin/sh
test_description='Test dws-jobtap plugin with fake coral2_dws.py'
. $(dirname $0)/sharness.sh
test_under_flux 2 job
flux setattr log-stderr-level 1
PLUGINPATH=${FLUX_BUILD_DIR}/src/job-manager/plugins/.libs
DWS_SCRIPT=${SHARNESS_TEST_SRCDIR}/dws-dependencies/coral2_dws.py
DEPENDENCY_NAME="dws-create"
PROLOG_NAME="dws-setup"
EPILOG_NAME="dws-epilog"
test_expect_success 'job-manager: load dws-jobtap plugin' '
flux jobtap load ${PLUGINPATH}/dws-jobtap.so
'
test_expect_success 'job-manager: dws jobtap plugin works when job hits exception during prolog' '
create_jobid=$(flux mini submit -t 8 --output=dws4.out --error=dws4.out \
flux python ${DWS_SCRIPT} --setup-hang) &&
flux job wait-event -vt 15 -p guest.exec.eventlog ${create_jobid} shell.start &&
jobid=$(flux mini submit --setattr=system.dw="foo" hostname) &&
flux job wait-event -vt 5 -m description=${PROLOG_NAME} \
${jobid} prolog-start &&
flux job cancel $jobid
flux job wait-event -vt 1 ${jobid} exception &&
flux job wait-event -vt 5 -m description=${PROLOG_NAME} -m status=1 \
${jobid} prolog-finish &&
flux job wait-event -vt 5 -m description=${EPILOG_NAME} \
${jobid} epilog-start &&
flux job wait-event -vt 5 -m description=${EPILOG_NAME} \
${jobid} epilog-finish &&
flux job wait-event -vt 5 ${jobid} clean &&
flux job wait-event -vt 5 ${create_jobid} clean
'
test_done
Note that this test seems to be simulating a "hang" in the dws server if I'm reading it right. That is, the jobtap plugin sends an RPC to the dws service, but in this test the dws service doesn't respond. I wonder if that has something to do with the later hang in zmq_ctx_term()
.
from flux-core.
The cause here seems to be a leaked future in the dws-jobtap plugin. When the service fails to respond, the callback wherein the future is destroyed is not invoked, so a future is leaked.
If the future is set to be destroyed along with the job, then this hang goes away:
diff --git a/src/job-manager/plugins/dws-jobtap.c b/src/job-manager/plugins/dws-jobtap.c
index fe186a8..2837a52 100644
--- a/src/job-manager/plugins/dws-jobtap.c
+++ b/src/job-manager/plugins/dws-jobtap.c
@@ -290,13 +290,13 @@ static void setup_rpc_cb (flux_future_t *f, void *arg)
prolog_active);
}
}
-
done:
- flux_future_destroy (f);
+ return;
}
static void fetch_R_callback (flux_future_t *f, void *arg)
{
+ flux_plugin_t *p = arg;
json_t *R;
flux_t *h = flux_future_get_flux (f);
struct create_arg_t *args = flux_future_aux_get (f, "flux::fetch_R");
@@ -355,7 +355,13 @@ static void fetch_R_callback (flux_future_t *f, void *arg)
0,
"Failed to send dws.setup RPC",
prolog_active);
+ goto done;
}
+ flux_jobtap_job_aux_set (p,
+ args->id,
+ NULL,
+ setup_rpc_fut,
+ (flux_free_f) flux_future_destroy);
done:
flux_future_destroy (f);
@@ -412,7 +418,7 @@ static int run_cb (flux_plugin_t *p,
prolog_active,
NULL)
< 0
- || flux_future_then (fetch_R_future, -1., fetch_R_callback, NULL) < 0) {
+ || flux_future_then (fetch_R_future, -1., fetch_R_callback, p) < 0) {
flux_future_destroy (fetch_R_future);
flux_log_error (h,
"dws-jobtap: "
I'll open an issue in flux-coral2 on the leaks. However, this hang behavior does seem to be new...
from flux-core.
A leaked future causes a hang? I never would have been able to figure that out... thanks for digging into this @grondo !
from flux-core.
Great job running this down. Hmm, I belive the future takes a reference on the flux_t handle which for a broker module contains one end of a zeromq socket for the shmem:// connection. So when the module unloads, the zeromq socket might not be getting destroyed due to the future's reference. The broker just checks if all the modules were cleanly shut down on the presumption that if they went through their teardown, the flux_t refcounts would be zero.
I'll open an issue in flux-core.
Edit: oops! This is flux core and this is the issue. Got it!
from flux-core.
Just happened to notice the commit that added the reference on the handle: 01b383b
Thought I'd x-ref it here in case someone (like me) is tempted to muck around with this refcounting.
from flux-core.
Related Issues (20)
- idea: per-namespace KVS limits
- flux-batch: add convenience option for `-o output.stdout.type=kvs` HOT 1
- test/kvs_txn.c build failure on aarch64 under rpmbuild HOT 2
- `flux job list-ids --wait-state=inactive` doesn't work
- broker: `parent-uri` attribute is set for instance with no parent HOT 7
- flux-job: `client.c:432: pty_read_cb: read: Input/output error` when nesting jobs with interactive ptys HOT 4
- updating the flux message protocol will result in dropped messages when versions are mismatched
- flux in the last year HOT 1
- not ok 3 - fileref_create chunksize=0 'a-aa' works (2 sha1 blobrefs)
- Consistent support for `--quiet` for submit and batch HOT 1
- libflux: unauthorized requests are not responded to unless they have a matchtag
- Flux 0.57.0, `flux filemap get` fails if file was mapped with multiple chunks HOT 7
- KVS access to job's private namespace hangs HOT 8
- invalid userid in exception log entry HOT 3
- doc: flux_requeue(3) error in manpage
- fully remove flux-mini
- flux-broker: stdin is not a tty - can't run interactive shell HOT 7
- job-list: limit size of job constraint
- build: missing use of `JANSSON_CFLAGS` in Makefile?
- priority plugin posting identical jobspec-update event twice HOT 5
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from flux-core.