Giter VIP home page Giter VIP logo

Comments (11)

grondo avatar grondo commented on September 9, 2024 1

Oh, flux-perilog-run knows if a command was timed out because the processes' canceled attribute is set. Would you be able to test if this fixes the issue?

diff --git a/src/cmd/flux-perilog-run.py b/src/cmd/flux-perilog-run.py
index 4753257d1..821e7cd08 100755
--- a/src/cmd/flux-perilog-run.py
+++ b/src/cmd/flux-perilog-run.py
@@ -175,8 +175,12 @@ async def run_per_rank(name, jobid, args):
         if proc.canceled:
             timeout_ids.set(rank)
             rc = 128 + signal.SIGTERM
-        elif rc != 0:
+        elif rc > 0:
+            # Process failed with nonzero exit code
             fail_ids.set(rank)
+        else:
+            # Process died by signal. Likely prolog was canceled (no error)
+            pass
         if rc > returncode:
             returncode = rc
 

from flux-core.

garlick avatar garlick commented on September 9, 2024 1

s/&&/and/ and that works!

 garlick@picl0:~$ flux run hostname
^Cflux-job: one more ctrl-C within 2s to cancel or ctrl-Z to detach    00:00:03
^C3.976s: job.exception type=cancel severity=0 interrupted by ctrl-C
4.136s: job.exception type=prolog severity=0 prolog killed by signal 15 (timeout or job canceled)

 garlick@picl0:~$ flux resource drain
TIME         STATE    REASON                         NODELIST

from flux-core.

grondo avatar grondo commented on September 9, 2024

Yeah, I'm guessing this was introduced by the prolog timeout feature 😞

Have to think about how to better handle this.

from flux-core.

garlick avatar garlick commented on September 9, 2024

The prolog seems to be immediately exiting with exit code 1 when I apply this diff (no ctrl-C needed), but the node does not get drained.

$ flux job info $(flux job last) eventlog
{"timestamp":1707882327.8191476,"name":"submit","context":{"userid":5588,"urgency":16,"flags":0,"version":1}}
{"timestamp":1707882327.8324931,"name":"validate"}
{"timestamp":1707882327.8444905,"name":"depend"}
{"timestamp":1707882327.8445888,"name":"priority","context":{"priority":16}}
{"timestamp":1707882327.8507416,"name":"alloc"}
{"timestamp":1707882327.8511033,"name":"prolog-start","context":{"description":"job-manager.prolog"}}
{"timestamp":1707882327.8511505,"name":"prolog-start","context":{"description":"cray-pals-port-distributor"}}
{"timestamp":1707882327.8646581,"name":"prolog-finish","context":{"description":"cray-pals-port-distributor","status":0}}
{"timestamp":1707882327.9927156,"name":"exception","context":{"type":"prolog","severity":0,"note":"prolog exited with exit code=1","userid":500}}
{"timestamp":1707882327.992943,"name":"prolog-finish","context":{"description":"job-manager.prolog","status":256}}
{"timestamp":1707882327.9939311,"name":"free"}
{"timestamp":1707882327.9940138,"name":"clean"}

prolog script looks like

#!/bin/bash

#trap -- '' SIGTERM
sleep 5
exit 0

and config

[job-manager.prolog]
command = [
   "flux", "perilog-run", "prolog",
   "--sdexec",
   "--timeout=10s",
   "--with-imp",
   "-e", "prolog"
]

from flux-core.

grondo avatar grondo commented on September 9, 2024

Oops, maybe a Python error. I should have tried it before posting the diff.

from flux-core.

grondo avatar grondo commented on September 9, 2024

Anything in flux dmesg? I ran with similar changes and didn't get a failure.

from flux-core.

garlick avatar garlick commented on September 9, 2024

My apologies! I didn't think to look there and clearly I messed it up:

2024-02-14T03:49:40.108209Z job-manager.err[0]: ƒkeg2JomVTD: prolog: stderr: Traceback (most recent call last):
2024-02-14T03:49:40.108268Z job-manager.err[0]: ƒkeg2JomVTD: prolog: stderr:   File "/usr/lib/aarch64-linux-gnu/flux/cmd/py-runner.py", line 94, in <module>
2024-02-14T03:49:40.108720Z job-manager.err[0]: ƒkeg2JomVTD: prolog: stderr:     runpy.run_path(sys.argv[0], run_name="__main__")
2024-02-14T03:49:40.108752Z job-manager.err[0]: ƒkeg2JomVTD: prolog: stderr:   File "/usr/lib/python3.9/runpy.py", line 267, in run_path
2024-02-14T03:49:40.108773Z job-manager.err[0]: ƒkeg2JomVTD: prolog: stderr:     code, fname = _get_code_from_file(run_name, path_name)
2024-02-14T03:49:40.108793Z job-manager.err[0]: ƒkeg2JomVTD: prolog: stderr:   File "/usr/lib/python3.9/runpy.py", line 242, in _get_code_from_file
2024-02-14T03:49:40.110324Z job-manager.err[0]: ƒkeg2JomVTD: prolog: stderr:     code = compile(f.read(), fname, 'exec')
2024-02-14T03:49:40.110417Z job-manager.err[0]: ƒkeg2JomVTD: prolog: stderr:   File "/usr/lib/aarch64-linux-gnu/flux/cmd/flux-perilog-run.py", line 181
2024-02-14T03:49:40.110438Z job-manager.err[0]: ƒkeg2JomVTD: prolog: stderr:     else:
2024-02-14T03:49:40.110458Z job-manager.err[0]: ƒkeg2JomVTD: prolog: stderr: TabError: inconsistent use of tabs and spaces in indentation

will fix and retest.

from flux-core.

garlick avatar garlick commented on September 9, 2024

OK, now doesn't fail immediately but I'm still able to abort the prolog and cause the node to drain:

$ flux run hostname
^Cflux-job: one more ctrl-C within 2s to cancel or ctrl-Z to detach    00:00:02
^C2.706s: job.exception type=cancel severity=0 interrupted by ctrl-C
2.819s: job.exception type=prolog severity=0 prolog killed by signal 15 (timeout or job canceled)

 garlick@picl0:/etc/flux/system/conf.d$ flux resource drain
TIME         STATE    REASON                         NODELIST
Feb14 05:18  drained  prolog failed for jobid ƒkj9R+ picl0

would the flux-perilog-run.py script itself be getting the signal?

from flux-core.

grondo avatar grondo commented on September 9, 2024

The first thing flux-perilog-run.py does is block SIGTERM and SIGINT:

def main():
signal.pthread_sigmask(signal.SIG_BLOCK, {signal.SIGTERM, signal.SIGINT})

So that isn't supposed to happen. Also, I think the perillog jobtap plugin doesn't drain ranks but leaves that up to the prolog script. That the node was drained indicates that the flux-perilog-run wasn't killed before it was able to drain nodes.

from flux-core.

grondo avatar grondo commented on September 9, 2024

I can reproduce this locally, so I'll run down the root cause.

from flux-core.

grondo avatar grondo commented on September 9, 2024

Ah, a shell script signaled by SIGTERM may return an exit code of 143 instead of the standard wait status of 15 since the foreground process will likely be signaled and not the shell itself. The right fix is something like this:

diff --git a/src/cmd/flux-perilog-run.py b/src/cmd/flux-perilog-run.py
index 4753257d1..30fe75140 100755
--- a/src/cmd/flux-perilog-run.py
+++ b/src/cmd/flux-perilog-run.py
@@ -175,8 +175,14 @@ async def run_per_rank(name, jobid, args):
         if proc.canceled:
             timeout_ids.set(rank)
             rc = 128 + signal.SIGTERM
-        elif rc != 0:
+        elif rc > 0 && rc <= 128:
+            # Process failed with nonzero exit code, not shell reporting
+            # killed by signal (128+n)
             fail_ids.set(rank)
+        else:
+            # Process died by signal. Likely prolog was canceled (do not add
+            # to the set of ranks to be drained)
+            pass
         if rc > returncode:
             returncode = rc
 

from flux-core.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.