Giter VIP home page Giter VIP logo

Comments (7)

grondo avatar grondo commented on September 9, 2024

Short answer, use flux submit instead of flux mini submit.

flux mini was deprecated in v0.48.0 (2023-03-7) and dropped in v0.59.0 (2024-02-06) in favor of flux submit, flux run, flux alloc, etc.

from flux-core.

jpb3698 avatar jpb3698 commented on September 9, 2024

Great, that worked, thanks! However, I'm having issues where Slurm is killing my job after the first 720 tasks are run. I'm getting these messages: broker.err[3]: quartz405 (rank 7) transitioning to LOST due to EHOSTUNREACH error on send, and at then at the end I get this:

srun: error: Node failure on quartz405
slurmstepd: error: *** STEP 2483483.0 ON quartz116 CANCELLED AT 2024-03-27T12:32:44 DUE TO NODE FAILURE, SEE SLURMCTLD LOG FOR DETAILS ***
slurmstepd: error: *** JOB 2483483 ON quartz116 CANCELLED AT 2024-03-27T12:32:44 DUE TO NODE FAILURE, SEE SLURMCTLD LOG FOR DETAILS ***
srun: Job step aborted: Waiting up to 62 seconds for job step to finish.
Mar 27 12:32:44.534897 broker.err[0]: rc2.0: ./runScript.sh 0 Hangup (rc=129) 6988.9s

from flux-core.

jpb3698 avatar jpb3698 commented on September 9, 2024

Any ideas on why this is happening? I tried to search for the SLURMCTLD log file, but I couldn't find it

from flux-core.

grondo avatar grondo commented on September 9, 2024

It looks like quartz405 crashed or became unresponsive according to Slurm. If there wasn't an actual node failure, another reason this could happen is that the node is run out of memory.

The slurmctld log file is usually on the management node and is viewable by sysadmins only, not sure why Slurm is telling the user to look there. If nothing else is evident, you could try contacting the local hotline to ask what happened to that node.

from flux-core.

jpb3698 avatar jpb3698 commented on September 9, 2024

Got it, thank you!

from flux-core.

grondo avatar grondo commented on September 9, 2024

BTW, I'll just mention here since I noticed this in your script:

Usage such as:

for proc in {0..719}
do
    JOBIDS="$JOBIDS $(flux submit -n1 --output=result.${proc}.out command input.${proc}.in)"
done
wait_all "$JOBIDS"

Can now be much more efficiently submitted and waited for in a single submit command with

flux submit --wait --cc=0-719 -n1 --output=result.{cc}.out command input.{cc}.in

The --cc option submits multiples of the same job, with {cc} substituted with the index in each submission.

The --wait option in flux submit waits for all submitted jobs to complete before returning.

I'm not suggesting you change your current script if it is working fine now, but this is more a FYI and instructional for others that may stumble across this issue report. flux submit -cc will be many times faster at submitting jobs than multiple flux submit calls in a loop.

Thanks!

from flux-core.

grondo avatar grondo commented on September 9, 2024

Assuming this question is answered, I'll go ahead and close the issue. Feel free to reopen if there's anything else.

from flux-core.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.