Giter VIP home page Giter VIP logo

Comments (9)

grondo avatar grondo commented on June 24, 2024 1

A simple test worked for me, but this is the simplest case. Any hints on what you might have been doing different?

$ squeue -u grondo
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
           1521171    pdebug interact   grondo  R       0:44      1 quartz3
$ flux uri slurm:1521171
ssh://quartz3/var/tmp/grondo/flux-ptC8HX/local-0
$ flux proxy slurm:1521171
f(s=1,d=0) $ flux resource list
     STATE NNODES   NCORES    NGPUS NODELIST
      free      1       36        0 quartz3
 allocated      0        0        0 
      down      0        0        0 

Also this reminds me that this would be a good testcase to add to our extra tests for the gitlab CI. (cc @wihobbs)

from flux-core.

grondo avatar grondo commented on June 24, 2024 1

I think it would work if you did srun flux start because then Flux would actually be running under Slurm.

Should test this one though...

from flux-core.

wihobbs avatar wihobbs commented on June 24, 2024

Good idea @grondo. @garlick I tried the same thing as Mark but varied the number of nodes, and put some nested instances in there, and tried using flux uri slurm:jobid both in and out of the session. This is reminding me of when the LSF resolver broke because I got an allocation on lassen9-11 and the command in LSF sorted lassen10 as the rank 0 node due to the 1 being a lower number than 9...could be a really weird one off case like that. In any event, good idea to add that to our testing on real clusters.

from flux-core.

garlick avatar garlick commented on June 24, 2024

It worked for me just now. I was probably doing something dumb before! Sorry for the noise.

from flux-core.

garlick avatar garlick commented on June 24, 2024

Ah this is what I was doing. But perhaps this isn't intended to work:

[garlick@quartz386:~]$ sbatch -p pbatch -N2 --job-name flux --wrap "flux start sleep 120"
Submitted batch job 1533848
[garlick@quartz386:~]$ squeue|grep 1533848
           1533848    pbatch     flux  garlick  R       0:17      2 quartz[161-162]
[garlick@quartz386:~]$ flux uri slurm:1533848
flux-uri: ERROR: Unable to resolve Flux URI for Slurm job 1533848

from flux-core.

garlick avatar garlick commented on June 24, 2024

Reopening since it would be nice if this worked for slurm batch jobs. I think the only problem is that the batch script is the first child of the slurmstepd and we need to look one level deeper if the first LOCALID=0 process does not work out. Perhaps we could just try the pids in sorted order?

On the first node of a job submitted like above:

[garlick@quartz161:~]$ scontrol listpids
PID      JOBID    STEPID   LOCALID GLOBALID
3188397  1533855  batch    0       0
3188401  1533855  batch    -       -
3188489  1533855  batch    -       -
-1       1533855  extern   0       0
3188390  1533855  extern   -       -

and those pids are:

UID          PID    PPID  C STIME TTY          TIME CMD
garlick  3188397 3188392  0 05:41 ?        00:00:00 /bin/sh /var/spool/slurmd/job1533855/slurm_script
garlick  3188401 3188397  0 05:41 ?        00:00:00 /usr/libexec/flux/cmd/flux-broker sleep 360
garlick  3188489 3188401  0 05:41 ?        00:00:00 sleep 360
root     3188390 3188385  0 05:41 ?        00:00:00 sleep 100000000

        ├─slurmstepd─┬─slurm_script───flux-broker-0─┬─sleep
        │            │                              └─17*[{flux-broker-0}]
        │            └─2*[{slurmstepd}]

Confirmed that flux uri pid:3188401 works.

from flux-core.

grondo avatar grondo commented on June 24, 2024

The slurm resolver doesn't walk the process tree of slurmstepd, but uses scontrol listpids to list the pids for the job (and I think for all job steps for a batch/alloc job). In this case flux start is not run under srun, so the PID of the broker won't be available.

I think it would work if you did srun flux start because then Flux would actually be running under Slurm.

Not to say we couldn't fix this particular case, but searching for the first flux-broker that happens to be running under a Slurm batch job might give surprising results. For example, I could get a random test instance returned if running make -j 16 check in flux-core under a batch job...

from flux-core.

grondo avatar grondo commented on June 24, 2024

Just another thought, we'd have a similar issue with flux if you run flux batch -N1 --wrap flux start. Since the flux start is a singleton not run under flux run you couldn't get the uri with flux URI jobid1/jobid2...

from flux-core.

garlick avatar garlick commented on June 24, 2024

Oh duh! My example was not doing what I thought it was - I was just starting a size=1 flux instance on the first node of the batch allocation wasn't I? Yeah this works

[garlick@quartz386:~]$ sbatch -p pdebug -N2 --job-name flux --wrap "srun flux start sleep 360"
Submitted batch job 1533904
[garlick@quartz386:~]$ flux uri slurm:1533904
ssh://quartz3/var/tmp/garlick/flux-H2g2KI/local-0
[garlick@quartz386:~]$ flux proxy slurm:1533904
[garlick@quartz386:~]$ flux resource list
     STATE NNODES   NCORES    NGPUS NODELIST
      free      2       72        0 quartz[3-4]
 allocated      0        0        0 
      down      0        0        0 

Sorry for the noise!

from flux-core.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.