Giter VIP home page Giter VIP logo

Comments (7)

garlick avatar garlick commented on June 24, 2024 1

Well I guess a test failed and that uncovered a bug in the module, so that's a good outcome compared to before. But ideally we would log something, avoid the hang, and exit with a nonzero exit code. Maybe we could manage:

job-manager: module leaked a socket reference - was a future leaked?
broker: skipping 0MQ shutdown due to presumed module socket leak

from flux-core.

grondo avatar grondo commented on June 24, 2024

The zmq_ctx_term documentation states:

After interrupting all blocking calls, zmq_ctx_term() shall block until the following conditions are satisfied:

  • All sockets open within context have been closed with zmq_close().
  • For each socket within context, all messages sent by the application with zmq_send() have either been physically transferred to a network peer, or the socket's linger period set with the ZMQ_LINGER socket option has expired.

So perhaps there is an outstanding socket still open or a message stuck?

from flux-core.

grondo avatar grondo commented on June 24, 2024

If I "comment out" the last test in t1000-dws-dependencies.t, then the hang does not occur. In fact, the following test alone reproduces the issue by itself:

#!/bin/sh

test_description='Test dws-jobtap plugin with fake coral2_dws.py'

. $(dirname $0)/sharness.sh

test_under_flux 2 job

flux setattr log-stderr-level 1

PLUGINPATH=${FLUX_BUILD_DIR}/src/job-manager/plugins/.libs
DWS_SCRIPT=${SHARNESS_TEST_SRCDIR}/dws-dependencies/coral2_dws.py
DEPENDENCY_NAME="dws-create"
PROLOG_NAME="dws-setup"
EPILOG_NAME="dws-epilog"

test_expect_success 'job-manager: load dws-jobtap plugin' '
	flux jobtap load ${PLUGINPATH}/dws-jobtap.so
'

test_expect_success 'job-manager: dws jobtap plugin works when job hits exception during prolog' '
	create_jobid=$(flux mini submit -t 8 --output=dws4.out --error=dws4.out \
		flux python ${DWS_SCRIPT} --setup-hang) &&
	flux job wait-event -vt 15 -p guest.exec.eventlog ${create_jobid} shell.start &&
	jobid=$(flux mini submit --setattr=system.dw="foo" hostname) &&
	flux job wait-event -vt 5 -m description=${PROLOG_NAME} \
		${jobid} prolog-start &&
	flux job cancel $jobid
	flux job wait-event -vt 1 ${jobid} exception &&
	flux job wait-event -vt 5 -m description=${PROLOG_NAME} -m status=1 \
		${jobid} prolog-finish &&
	flux job wait-event -vt 5 -m description=${EPILOG_NAME} \
		${jobid} epilog-start &&
	flux job wait-event -vt 5 -m description=${EPILOG_NAME} \
		${jobid} epilog-finish &&
	flux job wait-event -vt 5 ${jobid} clean &&
	flux job wait-event -vt 5 ${create_jobid} clean
'
test_done

Note that this test seems to be simulating a "hang" in the dws server if I'm reading it right. That is, the jobtap plugin sends an RPC to the dws service, but in this test the dws service doesn't respond. I wonder if that has something to do with the later hang in zmq_ctx_term().

from flux-core.

grondo avatar grondo commented on June 24, 2024

The cause here seems to be a leaked future in the dws-jobtap plugin. When the service fails to respond, the callback wherein the future is destroyed is not invoked, so a future is leaked.

If the future is set to be destroyed along with the job, then this hang goes away:

diff --git a/src/job-manager/plugins/dws-jobtap.c b/src/job-manager/plugins/dws-jobtap.c
index fe186a8..2837a52 100644
--- a/src/job-manager/plugins/dws-jobtap.c
+++ b/src/job-manager/plugins/dws-jobtap.c
@@ -290,13 +290,13 @@ static void setup_rpc_cb (flux_future_t *f, void *arg)
                                prolog_active);
         }
     }
-
 done:
-    flux_future_destroy (f);
+    return;
 }
 
 static void fetch_R_callback (flux_future_t *f, void *arg)
 {
+    flux_plugin_t *p = arg;
     json_t *R;
     flux_t *h = flux_future_get_flux (f);
     struct create_arg_t *args = flux_future_aux_get (f, "flux::fetch_R");
@@ -355,7 +355,13 @@ static void fetch_R_callback (flux_future_t *f, void *arg)
                            0,
                            "Failed to send dws.setup RPC",
                            prolog_active);
+        goto done;
     }
+    flux_jobtap_job_aux_set (p,
+                             args->id,
+                             NULL,
+                             setup_rpc_fut,
+                             (flux_free_f) flux_future_destroy);
 
 done:
     flux_future_destroy (f);
@@ -412,7 +418,7 @@ static int run_cb (flux_plugin_t *p,
                                     prolog_active,
                                     NULL)
                    < 0
-            || flux_future_then (fetch_R_future, -1., fetch_R_callback, NULL) < 0) {
+            || flux_future_then (fetch_R_future, -1., fetch_R_callback, p) < 0) {
             flux_future_destroy (fetch_R_future);
             flux_log_error (h,
                             "dws-jobtap: "

I'll open an issue in flux-coral2 on the leaks. However, this hang behavior does seem to be new...

from flux-core.

jameshcorbett avatar jameshcorbett commented on June 24, 2024

A leaked future causes a hang? I never would have been able to figure that out... thanks for digging into this @grondo !

from flux-core.

garlick avatar garlick commented on June 24, 2024

Great job running this down. Hmm, I belive the future takes a reference on the flux_t handle which for a broker module contains one end of a zeromq socket for the shmem:// connection. So when the module unloads, the zeromq socket might not be getting destroyed due to the future's reference. The broker just checks if all the modules were cleanly shut down on the presumption that if they went through their teardown, the flux_t refcounts would be zero.

I'll open an issue in flux-core.

Edit: oops! This is flux core and this is the issue. Got it!

from flux-core.

garlick avatar garlick commented on June 24, 2024

Just happened to notice the commit that added the reference on the handle: 01b383b

Thought I'd x-ref it here in case someone (like me) is tempted to muck around with this refcounting.

from flux-core.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.