I found more info about this, didn't realize it could be done this way but I think it

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<div class="snippet-clipboard-content notranslate position-relative overflow-auto" data

Thanks <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

K8s: improve ephemeral storage management about harvester HOT 11 CLOSED

hsf commented on September 26, 2024

K8s: improve ephemeral storage management

from harvester.

Comments (11)

fbarreir commented on September 26, 2024 2

Hi Ryan, this is deployed now. Try it out and let me know if you need any adjustment:
k8s.resources.use_ephemeral_storage_resource_specs
k8s.resources.ephemeral_storage_offset
k8s.resources.limits.ephemeral_storage_limit_safety_factor

from harvester.

rptaylor commented on September 26, 2024

Note from Rod:

maxwdir, divided by the corecount for SCORE, is the space for input+output+workdirsize. Input is known per job,
but output and workdirsize are measured from the scouts. Then it can vary for jobs, e.g. high lumi part of run has more output,
or more debug in stdout. So it can surely exceed the maxwdir, and it is usually harmless where the disk is shared on the node.
There are some safety limits applied by the pilot to stop it filling the whole disk, e.g. 2GB stdout, 2*maxwdir total(?).

from harvester.

rptaylor commented on September 26, 2024

@fbarreir
Rather than setting only a sizeLimit on the emptyDir, we should apply a holistic ephemeral storage request and limit for the whole pod, with the benefit of allowing flexible disk (over)use with moderate protections.

So I suggest a similar approach for disk as #116 , while we're on that.
I suggest two new properties in CRIC, perhaps something like: k8s.resources.ephemeral_storage.base_amount and k8s.resources.ephemeral_storage.safety_factor

The base amount is for k8s-specific storage requirements (e.g. the size of the pod's container image) or other per-job overhead independent of the number of cores and not accounted for in maxwdir (may only be a few GB in practice though.) And the safety factor is by how much the limit may exceed the request.

The YAML will look a bit different, e.g.

   resources:
      requests:
        ephemeral-storage: "100Gi"
      limits:
        ephemeral-storage: "150Gi"

where:

the request is equal to base_amount + X
the limit is equal to (base_amount + X) * safety_factor
X is the number Harvester currently uses for the emptyDir volume, based on prorating maxwdir by the number of cores

from harvester.

fbarreir commented on September 26, 2024

OK, this is actually better than the current emptyDir implementation and would be good to apply on all queues.

I wrote the code for this feature (not tested or deployed). I'm not clear about the formula, so for the moment I've just followed your suggestion.

The fields in CRIC would be:

k8s.resources.use_ephemeral_storage_resources: toggle to turn on/off limits AND requests. Eventually this feature should be used for all queues, but for the transition I prefer to enable the queues manually until I'm confident.
k8s.resources.ephemeral_storage_offset: what you called base_amount. Applies to both limits and requests. In MB.
k8s.resources.limits.ephemeral_storage_limit_safety_factor: the safety factor only applied to limits. In %.

Agree on the above?

How do you want to transition your queues? Are you going to keep the pilot-dir emptyDir or reconfigure those settings? If you keep it, I understand I would have to remove the code that sets the sizeLimit?

from harvester.

rptaylor commented on September 26, 2024

* `k8s.resources.use_ephemeral_storage_resources`:

Technically we are using ephemeral storage no matter what. Maybe k8s.resources.use_ephemeral_storage_quota would be better?

Yes that sounds good.

Note that units are important and should be applied consistently when translated from CRIC/Harvester to k8s.
https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#meaning-of-memory
In kubernetes, for both memory and storage, M means MB (10^6 bytes), and Mi means MiB (2^20 bytes).
Similarly for G and Gi. The difference between a given amount of G(B) and Gi(B) can be significant for large amounts.

Before transitioning, the emptyDir volume should look like this - NB the {}

  volumes:
    - name: pilot-dir
      emptyDir: {}

https://kubernetes.io/docs/concepts/storage/volumes/#emptydir-configuration-example
We still need the emptyDir volume to provide the ephemeral storage, it just won't have a sizeLimit on it anymore.
Then we add the ephemeral-storage request and limit to the resources section of the container in the pod.

from harvester.

fbarreir commented on September 26, 2024

Sorry, I was derailed with other time sensitive stuff.

I don't like the word quota in k8s.resources.use_ephemeral_storage_quota. Is k8s.resources.use_ephemeral_storage_resource_specs OK?
I reviewed the usage of binary (MiB, GiB) vs decimal (MB, GB) notation
- generally PanDA uses the binary notation, so we need to stick to that
- the memory resources were used wrong in the k8s plugins. I changed it to binary notation. This will increase the requests/limits by a couple of percent - do you see any risk for your production queue?
- I implemented the disk resources in binary notation. Interestingly I submit [1] and it appears as [2] in kubectl describe pod.

I'm ready to deploy, but prefer your green light regarding the memory increase side effect. Also the pilot-dir sizeLimit will disappear once I deploy (the volume stays defined as you said).

[1]

'resources': {'requests': {... 'ephemeral-storage': '7.1Gi'}, 'limits': {... 'ephemeral-storage': '7.81Gi'}}

[2]

    Limits:
...
      ephemeral-storage:  8385923645440m
    Requests:
...
      ephemeral-storage:  7623566950400m

from harvester.

rptaylor commented on September 26, 2024

Sure, k8s.resources.use_ephemeral_storage_resource_specs sounds okay.

There's no problem with changing the memory units, the k8s scheduler will make sure the requested resources are available either way. I can adjust the limits if needed.

Those limits and requests are correct though it is funny it gets converted to "millibytes", probably because those decimal numbers multiplied by 2^30 still have fractional parts, but presumably it gets rounded to whole bytes at some point afterward.

from harvester.

rptaylor commented on September 26, 2024

Thanks @fbarreir it is working!
However the amounts are indeed difficult to read in this form, e.g.

$ kubectl get pods -n harvester -o custom-columns=NAME:.metadata.name,CPU_R:.spec.containers[].resources.requests.cpu,CPU_L:.spec.containers[].resources.limits.cpu,MEM_R:.spec.containers[].resources.requests.memory,MEM_L:.spec.containers[].resources.limits.memory,DISK_R:.spec.containers[].resources.requests.ephemeral-storage,DISK_L:.spec.containers[].resources.limits.ephemeral-storage
NAME                     CPU_R   CPU_L   MEM_R     MEM_L     DISK_R             DISK_L
grid-job-9043323-nxzgw   7200m   8       16000Mi   20000Mi   161480032911360m   193778186977280m
grid-job-9043324-6wnvx   7200m   8       16000Mi   20000Mi   161480032911360m   193778186977280m
grid-job-9043325-znkdw   7200m   8       16000Mi   20000Mi   161480032911360m   193778186977280m
grid-job-9043344-9k869   900m    1       2000Mi    4000Mi    23858543329280m    28625957027840m
grid-job-9043350-55bq2   900m    1       2000Mi    4000Mi    23858543329280m    28625957027840m
grid-job-9043351-klq6c   900m    1       2000Mi    4000Mi    23858543329280m    28625957027840m
grid-job-9043352-nxkvl   7200m   8       16000Mi   20000Mi   161480032911360m   193778186977280m
grid-job-9043353-j8656   7200m   8       16000Mi   20000Mi   161480032911360m   193778186977280m
grid-job-9043354-mjbv5   7200m   8       16000Mi   20000Mi   161480032911360m   193778186977280m

It seems there may not be any solution identified for this yet: kubernetes/kubernetes#94445
Though it could be worked around by converting the disk request and limit from Gi to G.
There would just need to be a constant conversion factor of 2^30/10^9 to multiply by in Harvester if you think it is worth doing.

from harvester.

fbarreir commented on September 26, 2024

OK, I converted the ephemeral storage to G. Now it shows up like this. I would prefer not to mix the notations, but still better to avoid the m. We can still change it in the future if we think of something nicer.

    Limits:
      cpu:                8
      ephemeral-storage:  191260M
      memory:             20000Mi
    Requests:
      cpu:                7200m
      ephemeral-storage:  159380M
      memory:             16000Mi

from harvester.

fbarreir commented on September 26, 2024

Update: I'm setting Mi now. K8S doesn't convert those to m and like that all values are consistent.

  Limits:
      cpu:                8
      ephemeral-storage:  182400Mi
      memory:             20000Mi
    Requests:
      cpu:                7200m
      ephemeral-storage:  152000Mi
      memory:             16000Mi

from harvester.

rptaylor commented on September 26, 2024

Very nice, thanks!

from harvester.

K8s: improve ephemeral storage management about harvester HOT 11 CLOSED

Comments (11)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent