Giter VIP home page Giter VIP logo

Comments (9)

mej avatar mej commented on August 20, 2024

Thanks, Brie!

This is, at least in part, a known issue. Parallel/MPI jobs have a similar issue, in that the jobscript only exists on the job's head node.

I've been hesitant to invoke SLURM commands from within NHC since it is often spawned from SLURM itself, and I certainly don't want to risk creating any sort of deadlock situation. I have not, however, had a chance to discuss with @jette or @dannyauble whether or not this is safe; it may be perfectly okay in SLURM.

Do you happen to know if squeue talks to the local slurmd or the master node's slurmctld? Do you run NHC via SLURM or via cron?

from nhc.

jette avatar jette commented on August 20, 2024

Hi Michael,

squeue only talks with the master node's slurmctld.

There are very few slurm commands that communicate directly with the
local slurmd (e.g. slurm_job_step_get_pids). Also the slurmd is
extensively multi-threaded and I do not believe that you could create a
deadlock situation from NHC (if you did, I would consider that a Slurm
bug).

On 2016-05-04 22:58, Michael Jennings wrote:

Thanks, Brie!

This is, at least in part, a known issue. Parallel/MPI jobs have a
similar issue, in that the jobscript only exists on the job's head
node.

I've been hesitant to invoke SLURM commands from within NHC since it
is often spawned from SLURM itself, and I certainly don't want to risk
creating any sort of deadlock situation. I have not, however, had a
chance to discuss with @jette [1] or @dannyauble [2] whether or not
this is safe; it may be perfectly okay in SLURM.

Do you happen to know if squeue talks to the local slurmd or the
master node's slurmctld? Do you run NHC via SLURM or via cron?

You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub [3]

Links:

[1] https://github.com/jette
[2] https://github.com/dannyauble
[3] #15 (comment)

from nhc.

mej avatar mej commented on August 20, 2024

Okay, great, thanks @jette! That's exactly what I was hoping for.

@bbbbbrie: What this means is basically that your fix is dead-on correct. I'll just need to abstract out the command and parameters into variables, as I always do, just to give sites the flexibility of tweaking things if need be (e.g., for sites like ours for whom -w localhost gives an error and which must use -w <nodename> instead).

My only remaining concern would be one of concurrency at scale. Since SLURM (wisely) runs NHC on all nodes simultaneously, we can assume that they would all be running the squeue inquiry at roughly the same time as well. If that causes delays at scale, we might have to pre-generate the user <-> node mapping in advance. (Not that I think it will...I just like to plan ahead!) :-)

I'll look at implementing this, or if you'd prefer, feel free to go ahead and send a Pull Request against the dev branch. Thanks again for the report! And thanks Moe for your insights!

from nhc.

jette avatar jette commented on August 20, 2024

The squeue commands will be processed in parallel by the slurmctld (head node) daemon. At some point scalability will become a concern, but we don't see problems today with thousands of nodes. I would recommend adding filter options to squeue if possible (e.g. "--user=mej", "--state=running", etc.), which can reduce the amount of data the slurm daemons and commands need to process and improve scalability.

from nhc.

bbbbbrie avatar bbbbbrie commented on August 20, 2024

Thank you both for your feedback!

@mej I am submitting a Pull Request containing some of the details discussed here. I am using -w nodename as that syntax works where -w localhost fails (and continues to work where -w localhost succeeds). I have abstracted the squeue command and arguments in to variables as you suggested. I will defer to your judgment on style for this.

This has been tested with NHC 1.4.2 on clusters running SLURM 14.11 and 15.08.

Thank you and please let me know if you have any questions!

(Please feel free to close this when appropriate.)

from nhc.

OleHolmNielsen avatar OleHolmNielsen commented on August 20, 2024

We have a new CentOS 7.2 cluster running the Slurm 16.05 batch system. We have experienced the very same problem reported above! I look forward to an updated NHC version, and in the meantime I'll have to comment out any check_ps_unauth_users checks in nhc.conf.

from nhc.

mej avatar mej commented on August 20, 2024

@OleHolmNielsen: The mej:dev branch has this fix already. Are you able to build from that branch? I'm in the process of changing jobs right now, so I'm not sure when I'll be back in a position to get 1.4.3 into beta, I'm working on getting it figured out, though, and I'll see what I can do if you aren't able to build your own RPMs from the development tree.

from nhc.

OleHolmNielsen avatar OleHolmNielsen commented on August 20, 2024

Hi Michael,

On 10/29/2016 12:56 PM, Michael Jennings wrote:

@OleHolmNielsen https://github.com/OleHolmNielsen: The mej:dev
https://github.com/mej/nhc/tree/dev branch has this fix already. Are
you able to build from that branch? I'm in the process of changing jobs
right now, so I'm not sure when I'll be back in a position to get 1.4.3
</mej/nhc/milestone/1> into beta, I'm working on getting it figured out,
though, and I'll see what I can do if you aren't able to build your own
RPMs from the development tree.

Thanks for the info, and good luck with your new job!

I retrieved the https://github.com/mej/nhc/tree/dev zip-file, but I'm
not experienced in building RPMs from Github. Could you kindly add a
few instructions to the README file? This is what I figured out:

unzip nhc-dev.zip
mv nhc-dev lbnl-nhc-1.4.3
cd lbnl-nhc-1.4.3
./autogen.sh
cd ..
tar czvf ~/rpmbuild/SOURCES/lbnl-nhc-1.4.3.tar.gz lbnl-nhc-1.4.3
rpmbuild -ta ~/rpmbuild/SOURCES/lbnl-nhc-1.4.3.tar.gz

Would you agree that this sounds correct?

I'll be going to SC16 in Salt Lake City with a group of sysadmins from
Denmark, are you going to be there?

Best regards,
Ole

from nhc.

mej avatar mej commented on August 20, 2024

This is fixed in the dev branch which is about to be released. Closing this issue.

from nhc.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.