Giter VIP home page Giter VIP logo

Comments (11)

grondo avatar grondo commented on September 9, 2024 1

I know slurm can kill a job if one of the processes has a nonzero exit code. I haven't seen slurm kill a job because one of the processes returned a zero exit code, but I am happy to be corrected.

See the documentation of -W --wait= in the srun(1) man page:

-W, --wait=
Specify how long to wait after the first task terminates before terminating all remaining tasks. A value of 0 indicates an unlimited wait (a warning will be issued after 60 seconds). The default value is set by the WaitTime parameter in the slurm configuration file (see slurm.conf(5)). This option can be useful to ensure that a job is terminated in a timely fashion in the event that one or more tasks terminate prematurely. Note: The -K, --kill-on-bad-exit option takes precedence over -W, --wait to terminate the job immediately if a task exits with a non-zero exit code. This option applies to job allocations.

At LLNL we have WaitTime = 30 set in our Slurm config possibly since the time this option was added, because this ends up saving many wasted compute cycles in our environment. I'm not sure we've ever had an issue with the result being surprising, but this is probably because of our workload. Note that both Slurm's --wait and Flux's -o exit-timeout do not make a judgement about the actual exit code of the early exiting process -- whether the task exits zero or nonzero it is considered an abnormal condition if a task exits long before other tasks in a parallel job.

I think you make a good argument that exit-timeout = none should be the default. The default also appears to be WaitTime = 0 in Slurm, though a warning is issued as noted in the documentation above. Sites could then set a different default in the job shell initrc as described above. We should get some other input here (e.g. from @garlick and @ryanday36), but I'd be willing to change the default and update our site configuration. Maybe we could strategize an easier way for interested sites to set a default.

from flux-core.

grondo avatar grondo commented on September 9, 2024

Agreed the terminology is a bit confusing. The term "first" here indicates the order in which the tasks exits. Since tasks start in parallel there is no "first" task in a parallel job (instead we explicitly rank the task as task rank 0 - size-1). However, perhaps "first exiting task" would be more clear?

The current behaviour kills entire jobs when one of the tasks completed its work both a) earlier than the other tasks; and b) successfully, with a zero exit code.

For better or worse, this is the intended behavior. Most parallel jobs are MPI programs in which all tasks operate as a unit and if a task exits early this could cause the job to hang until a timeout.

To get the behavior you want until the default changes, you could add

if shell.options['exit-timeout'] == nil then
    shell.options['exit-timeout'] = "none"
end
if shell.options['exit-on-error']  == nil then
    shell.options['exit-on-error'] = 1
end

to the default shell initrc.lua.

I can see an argument to disable the exit-timeout by default and have sites where it is applicable set one in their initrc.lua (or have some other method of global configuration, perhaps in the instance config). The current behavior is modeled off the default configuration for other RMs here at LLNL.

from flux-core.

grondo avatar grondo commented on September 9, 2024

Oh, I should mention that exit-on-error will terminate a job immediately if a task exits with a nonzero status, which isn't exactly what you were requesting.

from flux-core.

dmcdougall avatar dmcdougall commented on September 9, 2024

Most parallel jobs are MPI programs in which all tasks operate as a unit and if a task exits early this could cause the job to hang until a timeout.

Maybe I'm missing something. If an MPI rank exits early, I expect that is usually an indication of a problem. Have you observed cases where an MPI rank (linux process) ends with a zero-exit in an erroneous situation? But it's certainly not a problem if one simply used MPI as a convenient process-launcher for a job that doesn't need any communication and just needed to expose parallelism.

It's also not clear to me how this is expected to work at MPI_Finalize()-time of an MPI application. MPI_Finalize() is a collective, but it is possible (perhaps implementation-dependent?) that for some ranks the call returns early and some processes end while others are still finishing in a situation where things aren't perfectly load-balanced. I admit I haven't run into these cases, and that's probably quite a contrived example. I'm simply trying to motivate why killing every process in a job because one of the processes ended early and successfully is likely going to be considered surprising behaviour for end-users.

I know slurm can kill a job if one of the processes has a nonzero exit code. I haven't seen slurm kill a job because one of the processes returned a zero exit code, but I am happy to be corrected.

from flux-core.

dmcdougall avatar dmcdougall commented on September 9, 2024

I didn't know about --wait. I suspect I'd only used systems that had it set to 0 and so I never really saw the effects of it. Thanks for pointing that out.

Thanks for hearing out my concern. I'm happy to hear other opinions too.

I lean towards having the defaults mirror the behaviour in slurm, but most of my experience has been with slurm and so I definitely have a bias.

from flux-core.

grondo avatar grondo commented on September 9, 2024

Thank you for commenting!

from flux-core.

garlick avatar garlick commented on September 9, 2024

That does seem like a good case for disabling the exit timeout behavior by default. FWIW, I've hit this too when running stuff that's not MPI in parallel (e.g. using flux run like pdsh(1))

I'm not sure if it's a great outcome if one site turns this on and another doesn't as that could lead to workflow script portability problems. What if by default we throw a non-fatal job exception that suggests the option?

from flux-core.

grondo avatar grondo commented on September 9, 2024

We did fashion the behavior after our site default, so I'm not sure if disallowing a change of default behavior would be acceptable. All job shell options can currently be set in the initrc -- are you suggesting that should not be possible?

I like the idea of having a warning (nonfatal exception) by default.

from flux-core.

garlick avatar garlick commented on September 9, 2024

Well maybe a topic for discussion anyway :-)

from flux-core.

grondo avatar grondo commented on September 9, 2024

My opinion is that it would be going too far to disallow site changes to job shell behavior. Even if we disable writing to shell.options from the initrc, shell plugins can also modify shell optional behavior, so we'd have to remove the ability for sites to add plugins as well to fulfill some kind of promise of perfect workflow script portability.

Sites can also make changes to configuration of other systems, the environment and default PATH, Unix shell profiles and initrcs, etc. which could potentially break user's jobs and "workflow scripts". So promising that your workflow environment will be identical when running under Flux is not something we can or should do.

The default shell initrc can be overridden at runtime, which is something users with sensitive workflows could consider (along with specifying an explicit environment), e.g. similar to using the bash --norc option to remove the influence of system bashrc files.

from flux-core.

garlick avatar garlick commented on September 9, 2024

Those are sensible arguments IMHO. Well, anyway, it's a bit off topic for this issue so we can take it up elsewhere if need be. (But I'm content for now)

from flux-core.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.