Comments (9)
A simple test worked for me, but this is the simplest case. Any hints on what you might have been doing different?
$ squeue -u grondo
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1521171 pdebug interact grondo R 0:44 1 quartz3
$ flux uri slurm:1521171
ssh://quartz3/var/tmp/grondo/flux-ptC8HX/local-0
$ flux proxy slurm:1521171
f(s=1,d=0) $ flux resource list
STATE NNODES NCORES NGPUS NODELIST
free 1 36 0 quartz3
allocated 0 0 0
down 0 0 0
Also this reminds me that this would be a good testcase to add to our extra tests for the gitlab CI. (cc @wihobbs)
from flux-core.
I think it would work if you did srun flux start because then Flux would actually be running under Slurm.
Should test this one though...
from flux-core.
Good idea @grondo. @garlick I tried the same thing as Mark but varied the number of nodes, and put some nested instances in there, and tried using flux uri slurm:jobid
both in and out of the session. This is reminding me of when the LSF resolver broke because I got an allocation on lassen9-11
and the command in LSF sorted lassen10
as the rank 0 node due to the 1
being a lower number than 9
...could be a really weird one off case like that. In any event, good idea to add that to our testing on real clusters.
from flux-core.
It worked for me just now. I was probably doing something dumb before! Sorry for the noise.
from flux-core.
Ah this is what I was doing. But perhaps this isn't intended to work:
[garlick@quartz386:~]$ sbatch -p pbatch -N2 --job-name flux --wrap "flux start sleep 120"
Submitted batch job 1533848
[garlick@quartz386:~]$ squeue|grep 1533848
1533848 pbatch flux garlick R 0:17 2 quartz[161-162]
[garlick@quartz386:~]$ flux uri slurm:1533848
flux-uri: ERROR: Unable to resolve Flux URI for Slurm job 1533848
from flux-core.
Reopening since it would be nice if this worked for slurm batch jobs. I think the only problem is that the batch script is the first child of the slurmstepd and we need to look one level deeper if the first LOCALID=0 process does not work out. Perhaps we could just try the pids in sorted order?
On the first node of a job submitted like above:
[garlick@quartz161:~]$ scontrol listpids
PID JOBID STEPID LOCALID GLOBALID
3188397 1533855 batch 0 0
3188401 1533855 batch - -
3188489 1533855 batch - -
-1 1533855 extern 0 0
3188390 1533855 extern - -
and those pids are:
UID PID PPID C STIME TTY TIME CMD
garlick 3188397 3188392 0 05:41 ? 00:00:00 /bin/sh /var/spool/slurmd/job1533855/slurm_script
garlick 3188401 3188397 0 05:41 ? 00:00:00 /usr/libexec/flux/cmd/flux-broker sleep 360
garlick 3188489 3188401 0 05:41 ? 00:00:00 sleep 360
root 3188390 3188385 0 05:41 ? 00:00:00 sleep 100000000
├─slurmstepd─┬─slurm_script───flux-broker-0─┬─sleep
│ │ └─17*[{flux-broker-0}]
│ └─2*[{slurmstepd}]
Confirmed that flux uri pid:3188401
works.
from flux-core.
The slurm resolver doesn't walk the process tree of slurmstepd, but uses scontrol listpids
to list the pids for the job (and I think for all job steps for a batch/alloc job). In this case flux start
is not run under srun
, so the PID of the broker won't be available.
I think it would work if you did srun flux start
because then Flux would actually be running under Slurm.
Not to say we couldn't fix this particular case, but searching for the first flux-broker that happens to be running under a Slurm batch job might give surprising results. For example, I could get a random test instance returned if running make -j 16 check
in flux-core under a batch job...
from flux-core.
Just another thought, we'd have a similar issue with flux if you run flux batch -N1 --wrap flux start
. Since the flux start
is a singleton not run under flux run
you couldn't get the uri with flux URI jobid1/jobid2
...
from flux-core.
Oh duh! My example was not doing what I thought it was - I was just starting a size=1 flux instance on the first node of the batch allocation wasn't I? Yeah this works
[garlick@quartz386:~]$ sbatch -p pdebug -N2 --job-name flux --wrap "srun flux start sleep 360"
Submitted batch job 1533904
[garlick@quartz386:~]$ flux uri slurm:1533904
ssh://quartz3/var/tmp/garlick/flux-H2g2KI/local-0
[garlick@quartz386:~]$ flux proxy slurm:1533904
[garlick@quartz386:~]$ flux resource list
STATE NNODES NCORES NGPUS NODELIST
free 2 72 0 quartz[3-4]
allocated 0 0 0
down 0 0 0
Sorry for the noise!
from flux-core.
Related Issues (20)
- flux queue idle hangs with no active jobs HOT 2
- mismatched version logs are not all that helpful
- job stuck with active shells after timeout HOT 1
- libsubprocess: pass flags to remote subprocess HOT 2
- shell: input: stop writing stdin when reader is not ready HOT 2
- t2410-sdexec-memlimit.t: not ok 13 - memory.high configuration changed HOT 10
- log message: resource expiration updated from 0.00 to 0.00 (-inf) HOT 1
- cray MPI: MPIDI_OFI_mpi_init_hook:Invalid argument HOT 1
- improperly sorted hostlist in 'flux resource list' output HOT 2
- content: pending stores are stuck after ENOSPC from backing store
- kvs: uncompleted RPCs should receive ENOSYS response when module is unloaded HOT 1
- alloc-bypass jobs get stuck in CLEANUP after a flux restart
- broker[0]: quorum-full: ignored in shutdown
- flux-dump: add option to make content read errors non-fatal
- flux-jobs: state PRIORITY is not documented
- flux-run(1) incorrectly reports that --input=FILENAME bypasses the KVS
- hostlist: perf issue in `hostlist_find_host()` due to `hostname_create()` HOT 1
- job-manager: problem with alloc queue on elcap HOT 9
- systemctl stop flux is delayed if upstream is offline HOT 1
- shell: truncated output message is repeated
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from flux-core.