Comments (5)
The OOM killer will terminate tasks with SIGKILL. If a task exited with wait status indicating this signal, that could trigger a check in the memory.events
file, or alternately, if the shell is separately watching the memory.events
file via inotify
, a flag could be set which would trigger an error message that tasks terminated with SIGKILL may have been victims of the oom killer.
from flux-core.
I think the shell (assuming it's not the target of the OOM) can simply do the following
- get the shell's cgroup root from
/proc/<pid>/cgroup
(it will contain0::path
, wherepath
is relative to/sys/fs/cgroup
) - read the
oom_kill
count frommemory.events
- if nonzero, log something like
memory cgroup out of memory: process killed
This could be checked in the shell task.exit
callback, although if multiple ranks were killed, I'm not sure if we can reliably conclude that the exiting task was the target of the oom, unless there were some way to retain the task's cgroup memory.events
post exit...
from flux-core.
I wonder if it would work to use something like inotify
or otherwise to watch the memory.events
file and report an OOM kill as it happens?
unless there were some way to retain the task's cgroup memory.events post exit...
Is there a separate memory.events
per pid?
from flux-core.
Is there a separate
memory.events
per pid?
Yes. I was worried about racy access since the child process is immediately reaped by the libev child watcher, but maybe there is a non racy way to consume all the events via inotify or even just by holding it open and reading it before closing it. Good thought! I'll try some experiments.
from flux-core.
Is there a separate memory.events per pid?
Er. I misspoke. The task /proc/<pid>/cgroup
points to the same cgroup dir as the shell. (Which makes sense I guess)
from flux-core.
Related Issues (20)
- hostlist: perf issue in `hostlist_find_host()` due to `hostname_create()` HOT 1
- job-manager: problem with alloc queue on elcap HOT 9
- systemctl stop flux is delayed if upstream is offline HOT 1
- shell: truncated output message is repeated
- broker: runat_abort cleanup (signal 15): No such file or directory HOT 3
- Minor documentation fix for flux-jobs
- doc: flux-exec manpage is confusing
- doc: document `userrc` job shell option
- resource: constant load on large system HOT 4
- content-sqlite: treat ENOSPC as a transient condition
- sdexec: take extra measures to ensure cleanup HOT 7
- sdexec: does not comply with RFC42 protocol
- broker: add timezone designator to log timestamps
- fluxion logs resource status changed for individual nodes HOT 2
- t2410-sdexec-memlimit.t hangs after job-exec switched to FLUX_SUBPROCESS_FLAGS_UNBUF HOT 3
- user feedback on error messages
- `flux overlay status` is slow on large systems
- sdexec: add stdin buffering HOT 2
- sdexec: broker segfault in outbuf_mark_free HOT 2
- flux-start silently ignores `--recovery` when `-s, --test-size` is also present
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from flux-core.