Comments (11)
Hi Ryan, this is deployed now. Try it out and let me know if you need any adjustment:
k8s.resources.use_ephemeral_storage_resource_specs
k8s.resources.ephemeral_storage_offset
k8s.resources.limits.ephemeral_storage_limit_safety_factor
from harvester.
Note from Rod:
maxwdir, divided by the corecount for SCORE, is the space for input+output+workdirsize. Input is known per job,
but output and workdirsize are measured from the scouts. Then it can vary for jobs, e.g. high lumi part of run has more output,
or more debug in stdout. So it can surely exceed the maxwdir, and it is usually harmless where the disk is shared on the node.
There are some safety limits applied by the pilot to stop it filling the whole disk, e.g. 2GB stdout, 2*maxwdir total(?).
from harvester.
@fbarreir
Rather than setting only a sizeLimit on the emptyDir, we should apply a holistic ephemeral storage request and limit for the whole pod, with the benefit of allowing flexible disk (over)use with moderate protections.
So I suggest a similar approach for disk as #116 , while we're on that.
I suggest two new properties in CRIC, perhaps something like: k8s.resources.ephemeral_storage.base_amount
and k8s.resources.ephemeral_storage.safety_factor
The base amount is for k8s-specific storage requirements (e.g. the size of the pod's container image) or other per-job overhead independent of the number of cores and not accounted for in maxwdir (may only be a few GB in practice though.) And the safety factor is by how much the limit may exceed the request.
The YAML will look a bit different, e.g.
resources:
requests:
ephemeral-storage: "100Gi"
limits:
ephemeral-storage: "150Gi"
where:
- the request is equal to
base_amount + X
- the limit is equal to
(base_amount + X) * safety_factor
- X is the number Harvester currently uses for the emptyDir volume, based on prorating maxwdir by the number of cores
from harvester.
OK, this is actually better than the current emptyDir implementation and would be good to apply on all queues.
I wrote the code for this feature (not tested or deployed). I'm not clear about the formula, so for the moment I've just followed your suggestion.
The fields in CRIC would be:
k8s.resources.use_ephemeral_storage_resources
: toggle to turn on/off limits AND requests. Eventually this feature should be used for all queues, but for the transition I prefer to enable the queues manually until I'm confident.k8s.resources.ephemeral_storage_offset
: what you called base_amount. Applies to both limits and requests. In MB.k8s.resources.limits.ephemeral_storage_limit_safety_factor
: the safety factor only applied to limits. In %.
Agree on the above?
How do you want to transition your queues? Are you going to keep the pilot-dir
emptyDir or reconfigure those settings? If you keep it, I understand I would have to remove the code that sets the sizeLimit
?
from harvester.
* `k8s.resources.use_ephemeral_storage_resources`:
Technically we are using ephemeral storage no matter what. Maybe k8s.resources.use_ephemeral_storage_quota
would be better?
Yes that sounds good.
Note that units are important and should be applied consistently when translated from CRIC/Harvester to k8s.
https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#meaning-of-memory
In kubernetes, for both memory and storage, M means MB (10^6 bytes), and Mi means MiB (2^20 bytes).
Similarly for G and Gi. The difference between a given amount of G(B) and Gi(B) can be significant for large amounts.
Before transitioning, the emptyDir volume should look like this - NB the {}
volumes:
- name: pilot-dir
emptyDir: {}
https://kubernetes.io/docs/concepts/storage/volumes/#emptydir-configuration-example
We still need the emptyDir volume to provide the ephemeral storage, it just won't have a sizeLimit on it anymore.
Then we add the ephemeral-storage request and limit to the resources section of the container in the pod.
from harvester.
Sorry, I was derailed with other time sensitive stuff.
-
I don't like the word quota in
k8s.resources.use_ephemeral_storage_quota
. Isk8s.resources.use_ephemeral_storage_resource_specs
OK? -
I reviewed the usage of binary (MiB, GiB) vs decimal (MB, GB) notation
- generally PanDA uses the binary notation, so we need to stick to that
- the memory resources were used wrong in the k8s plugins. I changed it to binary notation. This will increase the requests/limits by a couple of percent - do you see any risk for your production queue?
- I implemented the disk resources in binary notation. Interestingly I submit [1] and it appears as [2] in
kubectl describe pod
.
I'm ready to deploy, but prefer your green light regarding the memory increase side effect. Also the pilot-dir sizeLimit
will disappear once I deploy (the volume stays defined as you said).
[1]
'resources': {'requests': {... 'ephemeral-storage': '7.1Gi'}, 'limits': {... 'ephemeral-storage': '7.81Gi'}}
[2]
Limits:
...
ephemeral-storage: 8385923645440m
Requests:
...
ephemeral-storage: 7623566950400m
from harvester.
Sure, k8s.resources.use_ephemeral_storage_resource_specs
sounds okay.
There's no problem with changing the memory units, the k8s scheduler will make sure the requested resources are available either way. I can adjust the limits if needed.
Those limits and requests are correct though it is funny it gets converted to "millibytes", probably because those decimal numbers multiplied by 2^30 still have fractional parts, but presumably it gets rounded to whole bytes at some point afterward.
from harvester.
Thanks @fbarreir it is working!
However the amounts are indeed difficult to read in this form, e.g.
$ kubectl get pods -n harvester -o custom-columns=NAME:.metadata.name,CPU_R:.spec.containers[].resources.requests.cpu,CPU_L:.spec.containers[].resources.limits.cpu,MEM_R:.spec.containers[].resources.requests.memory,MEM_L:.spec.containers[].resources.limits.memory,DISK_R:.spec.containers[].resources.requests.ephemeral-storage,DISK_L:.spec.containers[].resources.limits.ephemeral-storage
NAME CPU_R CPU_L MEM_R MEM_L DISK_R DISK_L
grid-job-9043323-nxzgw 7200m 8 16000Mi 20000Mi 161480032911360m 193778186977280m
grid-job-9043324-6wnvx 7200m 8 16000Mi 20000Mi 161480032911360m 193778186977280m
grid-job-9043325-znkdw 7200m 8 16000Mi 20000Mi 161480032911360m 193778186977280m
grid-job-9043344-9k869 900m 1 2000Mi 4000Mi 23858543329280m 28625957027840m
grid-job-9043350-55bq2 900m 1 2000Mi 4000Mi 23858543329280m 28625957027840m
grid-job-9043351-klq6c 900m 1 2000Mi 4000Mi 23858543329280m 28625957027840m
grid-job-9043352-nxkvl 7200m 8 16000Mi 20000Mi 161480032911360m 193778186977280m
grid-job-9043353-j8656 7200m 8 16000Mi 20000Mi 161480032911360m 193778186977280m
grid-job-9043354-mjbv5 7200m 8 16000Mi 20000Mi 161480032911360m 193778186977280m
It seems there may not be any solution identified for this yet: kubernetes/kubernetes#94445
Though it could be worked around by converting the disk request and limit from Gi to G.
There would just need to be a constant conversion factor of 2^30/10^9 to multiply by in Harvester if you think it is worth doing.
from harvester.
OK, I converted the ephemeral storage to G. Now it shows up like this. I would prefer not to mix the notations, but still better to avoid the m
. We can still change it in the future if we think of something nicer.
Limits:
cpu: 8
ephemeral-storage: 191260M
memory: 20000Mi
Requests:
cpu: 7200m
ephemeral-storage: 159380M
memory: 16000Mi
from harvester.
Update: I'm setting Mi
now. K8S doesn't convert those to m
and like that all values are consistent.
Limits:
cpu: 8
ephemeral-storage: 182400Mi
memory: 20000Mi
Requests:
cpu: 7200m
ephemeral-storage: 152000Mi
memory: 16000Mi
from harvester.
Very nice, thanks!
from harvester.
Related Issues (20)
- K8s: Connection errors to cluster lead to cancelled workers HOT 1
- K8S: state that can be considered failed, but is treated as pending
- state to consider failed in k8s monitor
- K8S: nodes to remove from GKE
- re-enable proxy checks for harvester k8s jobs HOT 6
- enforcing memory limits with safety factor for k8s jobs HOT 7
- optional priorityClasses for score / mcore jobs HOT 3
- Version number should be updated HOT 1
- set job and node environment variables for k8s HOT 5
- Raythena scope set incorrectly for output files HOT 4
- K8s: get rid of startup script
- Optional pip dependencies HOT 1
- pilots_starter.py writes to hardcoded /tmp location
- Implement minNewWorkersPerCycle? HOT 3
- Bug when checking missing attributes HOT 2
- RT aggregations not respected for GrandUnified queues HOT 1
- Job fetcher does not pick up new queues correctly HOT 1
- k8s queue customization through CRIC HOT 8
- check_credential_lifetime not in ArcproxyCredManager HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from harvester.