Giter VIP home page Giter VIP logo

Comments (10)

ryanday36 avatar ryanday36 commented on July 17, 2024 1

The goal is to add up the resource usage of all of a users running jobs and prevent them from starting a new job if their resource usage would exceed some limit. The approach that Mark is suggesting sounds like it would work in most cases. In principle a user could exceed the limit by submitting some jobs that only specify nodes and others that only specify cores, but that would probably be rare in practice. Another possible place where these limits wouldn't work would be jobs that specify a number of cores that isn't an even multiple of the number of cores per node on node exclusive clusters, as it sounds like those jobs will effectively reserve more cores than the ncores in the jobspec. Once again though, I don't know how common that scenario actually is in practice. The most common case for this is probably -n 1, and we could pick that up easily by always assuming a running job is using at least one node, but I could also see someone submitting a bunch of -n 40 jobs on a cluster with 36 cores per node and getting many more jobs than the limit because those aren't being as using 72 cores.

from flux-core.

grondo avatar grondo commented on July 17, 2024 1

Very good points @ryanday36.
@cmoussa1 apologies, I had lost sight of the overall goal for the accounting limits. My apologies. I think you seem to be headed on the right track

from flux-core.

ryanday36 avatar ryanday36 commented on July 17, 2024 1

just to continue to be a pain here, if we're going to convert things, it probably makes more sense to convert to ncores because that will work for both node exclusive and non-exclusive clusters (assuming we can tell if a cluster has a nodex match-policy and we can properly convert to the actual number of cores reserved for a given -n on nodex clusters).

from flux-core.

grondo avatar grondo commented on July 17, 2024 1

To answer your original question, you can get access to the resources in an instance by fetching resource.R from the KVS. You'll have to parse the result yourself though, we don't currently export an API to do that (though we've talked about it). The format for R is described in RFC 20

from flux-core.

grondo avatar grondo commented on July 17, 2024

I think the most complete solution might be to require both a cores and nodes limit for jobs, and if either is exceeded the job is rejected. This is what we ended up doing with the flux-core policy limits. This is mentioned in a note in flux-config-policy(5):

   NOTE:
     Limit checks take place before the scheduler sees the request, so it
     is possible to bypass a node limit by requesting only cores, or  the
     core limit by requesting only nodes (exclusively) since this part of
     the system does not have detailed resource  information.   Generally
     node  and core limits should be configured in tandem to be effective
     on resource sets with uniform cores per node.   Flux  does  not  yet
     have a solution for node/core limits on heterogeneous resources.

from flux-core.

cmoussa1 avatar cmoussa1 commented on July 17, 2024

OK, that might be an OK start. Are you thinking the limit would be represented like:

resources = nnodes + ncores

or something different? and if a job might exceed a max_resources, hold the job?

from flux-core.

grondo avatar grondo commented on July 17, 2024

I was thinking you'd check both values and if either exceeded the configured limit then the job is rejected. If you can't tell how much of a resource is in the jobspec, then just skip that test. That way you are always checking at least one limit.

from flux-core.

cmoussa1 avatar cmoussa1 commented on July 17, 2024

I think the goal (at least for accounting) here is to be able to enforce a resource limit across all of a user's running jobs. If we go with the above, if a job will exceed either limit (ncores or nnodes), I believe the job should be held until the user goes back under their limit. @ryanday36 should correct me if I am wrong, however.

But maybe we could just add a max_ncores limit to all user rows and check both like you mentioned??

from flux-core.

cmoussa1 avatar cmoussa1 commented on July 17, 2024

No problem @grondo - I probably should've given more background as to why the limit needed to be there in the first place. So it sounds like we should keep separate counts of both ncores and nnodes across a user's set of running jobs?

This is mainly why I asked if there was a function to gather total node/core counts on a system with flux resource info. 🙂 With this, I could at least estimate a cores-per-node count for that system, and when a user only specifies cores, it could be converted to a rough nnodes count. I understand this might not be entirely accurate, especially for systems where there is not a uniform cores-per-node count across all nodes, but perhaps it's an okay start? Sorry if I am still misunderstanding.

(actually, now that I think about it, if the above sounds okay, then I'm not sure keeping track of ncores across a user's set of running jobs is entirely necessary since we would be converting to nnodes)

from flux-core.

cmoussa1 avatar cmoussa1 commented on July 17, 2024

Thanks for the advice here. After some time playing around with this I think I was able to get somewhere. I've opened a PR over in flux-framework/flux-accounting#469 that proposes adding some work during plugin initialization where it tries to at least estimate the cores-per-node on the system it's loaded on by fetching resource.R. This could be a start to actually keeping track/estimating of a jobs' resources later on. Let me know what you think.

from flux-core.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.