Giter VIP home page Giter VIP logo

Comments (9)

garlick avatar garlick commented on September 27, 2024 1

I was able to reproduce this on my test system with a node that was down:

  • fluxion with easy queue policy
  • multiple queues
  • node was down before flux started

To reproduce:

  • submit a job with --requires=host:X where X is the down node
  • cancel it
  • observe leaked alloc in flux queue status

Still probing to determine what of the above characteristics are actually required to reproduce. So far I've confirmed it does not reproduce in a sub instance with a drained node, no queues, and fcfs queue policy.

from flux-core.

grondo avatar grondo commented on September 27, 2024

Forgot to mention that the apparent real number of pending jobs is 25:

# flux jobs -Af pending | wc -l
25
# flux module stats job-manager 
{
 "journal": {
  "listeners": 1
 },
 "active_jobs": 57,
 "inactive_jobs": 16598,
 "max_jobid": 402543535884075008
}

from flux-core.

grondo avatar grondo commented on September 27, 2024

Actually I'm unsure that this is causing the scheduling issue because the job manager should be sending unlimited alloc requests to the scheduler. A bit stumped at this point.

from flux-core.

grondo avatar grondo commented on September 27, 2024

FYI - reloading fluxion modules resolved this situation:

0 alloc requests queued
1 alloc requests pending to scheduler
41 running jobs

from flux-core.

garlick avatar garlick commented on September 27, 2024

Well it seems neither the --requires=host:X nor starting flux with a node down is required. It works to simply drain a node then submit a job that can't run without one more node, then cancel it.

 garlick@picl0:~$ sudo flux resource drain picl1
 garlick@picl0:~$ flux submit -N2 -q debug hostname
ƒ2deLurvY3R
 garlick@picl0:~$ flux jobs
       JOBID QUEUE    USER     NAME       ST NTASKS NNODES     TIME INFO
 ƒ2deLurvY3R debug    garlick  hostname    S      2      2      30s 
 garlick@picl0:~$ flux queue status -v
debug: Job submission is enabled
debug: Scheduling is started
all: Job submission is enabled
all: Scheduling is started
admin: Job submission is enabled
admin: Scheduling is started
batch: Job submission is enabled
batch: Scheduling is started
0 alloc requests queued
1 alloc requests pending to scheduler
0 running jobs
 garlick@picl0:~$ flux cancel $(flux job last)
 garlick@picl0:~$ flux jobs
       JOBID USER     NAME       ST NTASKS NNODES     TIME INFO
 garlick@picl0:~$ flux queue status -v
debug: Job submission is enabled
debug: Scheduling is started
all: Job submission is enabled
all: Scheduling is started
admin: Job submission is enabled
admin: Scheduling is started
batch: Job submission is enabled
batch: Scheduling is started
0 alloc requests queued
1 alloc requests pending to scheduler
0 running jobs

from flux-core.

garlick avatar garlick commented on September 27, 2024

OK it's trivially reproduceable in a standalone flux instance with no queues if the default queue policy of fcfs is changed to easy:

$ cat fluxion.toml
[sched-fluxion-qmanager]
queue-policy = "easy"
$ flux start -s2 -o,--conf=fluxion.toml
$ flux resource drain 0
$ flux submit -N2 hostname
ƒ2VGsWxto
$ flux cancel $(flux job last)
May 20 09:48:27.112761 sched-fluxion-qmanager.err[0]: jobmanager_cancel_cb: remove job (3284324581376): No such file or directory
$ flux jobs
       JOBID USER     NAME       ST NTASKS NNODES     TIME INFO
$ flux queue status -v
Job submission is enabled
Scheduling is started
0 alloc requests queued
1 alloc requests pending to scheduler
0 running jobs
$ exit
<hang>

from flux-core.

grondo avatar grondo commented on September 27, 2024

cc @trws since it was surprising that jobs were not scheduled by Fluxion when we hit this issue.

from flux-core.

trws avatar trws commented on September 27, 2024

Thanks @grondo, on my list for today to look into this. I think it should be solved by the change I pushed in the other day to deal with flux-framework/flux-sched#1208, but it's so easy for this to go wrong in unexpected ways I want to be completely sure.

from flux-core.

garlick avatar garlick commented on September 27, 2024

Let's close this issue. I opened flux-framework/flux-sched#1210 for athe sched follow-up

from flux-core.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.