Comments (2)
The epilog is currently run via the flux-perilog-run.py
helper command.
One solution would be to skip attempting to run flux exec -r{rank}
on nodes that are currently offline at the time the script is started, since flux exec
will definitely fail, and instead preemptively drain the node with a specific error node was offline for job ID epilog
or similar.
I assume in our environment the flux broker is not started on nodes until there is manual intervention, so there wouldn't necessarily be a race here where the node would come back in time to run the epilog successfully even though it had crashed.
from flux-core.
Correct. We still require a manual nodeup
to start the flux broker (and other stuff) after a node reboot. It takes over ten minutes for a node to reboot back to a point where nodeup
can be run. I'm assuming that is long enough for the epilog-running process to complete and drain these down nodes. We are also very unlikely to nodeup
nodes that don't have a drain message.
from flux-core.
Related Issues (20)
- Need a bulk submission tool for `flux batch`
- shell taskmap `block` scheme ignores its arguments
- flux-job: support `MPIR_executable_path` and `MPIR_server_arguments` in attach HOT 1
- Flux RADIUSS Tutorial Discussion Issue HOT 3
- liboptparse segfaults with duplicate subcommand option table entries
- docker-run-systest.sh does not work anymore
- nodes are drained when a user aborts a run request with prolog running HOT 11
- flux-shell: ERROR: output: shell_output_write: Function not implemented
- when a user aborts a job early, the prolog script may get SIGTERM
- make all jobs "waitable" HOT 1
- tracking issue: standby/preemptible jobs HOT 2
- idea: use host constraint for queues instead of properties
- TOSS 4 non-TCE openmpi: Failed to open drm root directory /sys/class/drm.: No such file or directory HOT 3
- job shell blocks at exit in degraded job HOT 1
- flux-archive: failures in GitLab CI for new command test HOT 7
- excessive log noise when a job is canceled HOT 4
- idea: flux-archive: support archive of an executable and its libraries HOT 3
- idea: make preferred TBON network configurable and inheritable
- shell: hwloc.xmlfile has no effect if HWLOC_COMPONENTS is set HOT 4
- Malleability (dynamic workloads) in FLUX HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from flux-core.