Giter VIP home page Giter VIP logo

Comments (6)

TomAugspurger avatar TomAugspurger commented on July 17, 2024

Does the idle_timeout configuration option do what you want?

from dask-gateway.

JoeJasinski avatar JoeJasinski commented on July 17, 2024

Thanks for posting that link. That looks like it might be what I'm looking for! Does it clean up the Custom Resources when it times out? I'll give it as shot. Thanks again

from dask-gateway.

JoeJasinski avatar JoeJasinski commented on July 17, 2024

I had a chance to try out that idle_timeout and it works well. One thing I noticed is that when the idle_timeout expires, the cluster gets deleted, but the "daskcluster" custom resource still exists. I imagine that those could accumulate over time if there were a lot of clusters spinning up and down. Is there a way to easily detect if those aren't running and clean up the Cutom Resources?

Also, I noticed that when the cluster does shut down, if there's still a python session connected, it returns a really confusing error message as below. I was wondering if there was a way to make that fail more gracefully. I didn't realize this was a timeout error until I saw that the cluster had been shut down in the (user inaccessible) logs.

root@dask-client:/src# python
Python 3.9.16 (main, Feb 11 2023, 02:49:26) 
[GCC 10.2.1 20210110] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from dask_gateway import Gateway
>>> 
>>> gateway = Gateway("http://traefik-dask-gateway:80")
>>> print(gateway.list_clusters())
[]
>>> cluster = gateway.new_cluster()
>>> client = cluster.get_client()
/usr/local/lib/python3.9/site-packages/distributed/client.py:1361: VersionMismatchWarning: Mismatched versions found

+-------------+----------------+----------------+---------+
| Package     | Client         | Scheduler      | Workers |
+-------------+----------------+----------------+---------+
| dask        | 2023.2.1       | 2022.12.1      | None    |
| distributed | 2023.2.1       | 2022.12.1      | None    |
| python      | 3.9.16.final.0 | 3.11.1.final.0 | None    |
+-------------+----------------+----------------+---------+
  warnings.warn(version_module.VersionMismatchWarning(msg[0]["warning"]))
>>> 2023-02-26 05:45:35,766 - tornado.application - ERROR - Exception in callback <bound method Client._heartbeat of <Client: 'tls://10.244.0.12:8786' processes=0 threads=0, memory=0 B>>
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/tornado/ioloop.py", line 921, in _run
    val = self.callback()
  File "/usr/local/lib/python3.9/site-packages/distributed/client.py", line 1445, in _heartbeat
    self.scheduler_comm.send({"op": "heartbeat-client"})
AttributeError: 'NoneType' object has no attribute 'send'
2023-02-26 05:45:40,766 - tornado.application - ERROR - Exception in callback <bound method Client._heartbeat of <Client: 'tls://10.244.0.12:8786' processes=0 threads=0, memory=0 B>>
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/tornado/ioloop.py", line 921, in _run
    val = self.callback()
  File "/usr/local/lib/python3.9/site-packages/distributed/client.py", line 1445, in _heartbeat
    self.scheduler_comm.send({"op": "heartbeat-client"})
AttributeError: 'NoneType' object has no attribute 'send'
2023-02-26 05:45:45,767 - tornado.application - ERROR - Exception in callback <bound method Client._heartbeat of <Client: 'tls://10.244.0.12:8786' processes=0 threads=0, memory=0 B>>
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/tornado/ioloop.py", line 921, in _run
    val = self.callback()
  File "/usr/local/lib/python3.9/site-packages/distributed/client.py", line 1445, in _heartbeat
    self.scheduler_comm.send({"op": "heartbeat-client"})

I created a dummy project to help me test it here. This is what I used in this example.
https://github.com/JoeJasinski/dask-gateway-testing

from dask-gateway.

jacobtomlinson avatar jacobtomlinson commented on July 17, 2024

This may be a duplicate of #255

from dask-gateway.

consideRatio avatar consideRatio commented on July 17, 2024

One thing I noticed is that when the idle_timeout expires, the cluster gets deleted, but the "daskcluster" custom resource still exists.

The k8s DaskCluster resource enters a "Stopped" state.

apiVersion: gateway.dask.org/v1alpha1
kind: DaskCluster
# ...
status:
  completionTime: "2023-10-25T11:43:39Z"
  credentials: dask-credentials-b3a990d302d84720aae27404f6153ade
  ingressroute: dask-b3a990d302d84720aae27404f6153ade
  ingressroutetcp: dask-b3a990d302d84720aae27404f6153ade
  phase: Stopped
  schedulerPod: dask-scheduler-b3a990d302d84720aae27404f6153ade
  service: dask-b3a990d302d84720aae27404f6153ade

The question about this could pivot to "should a stopped DaskCluster resources get cleaned up directly, or after some time?".

This is similar to having k8s Job resource creating a Pod to do some work. Then the Pod and Job is left in a "Completed" state a while. There is a topic about that.

When a Job completes, no more Pods are created, but the Pods are usually not deleted either. Keeping them around allows you to still view the logs of completed pods to check for errors, warnings, or other diagnostic output. The job object also remains after it is completed so that you can view its status. It is up to the user to delete old jobs after noting their status.

CronJob, that is a k8s resource to create Job resources, can cleanup the Job resources and it creates.

Finished Jobs are usually no longer needed in the system. Keeping them around in the system will put pressure on the API server. If the Jobs are managed directly by a higher level controller, such as CronJobs, the Jobs can be cleaned up by CronJobs based on the specified capacity-based cleanup policy.

It appears that in k8s 1.23+ (now probably used by most k8s clusters), there is a controller reading the k8s Job resource's ttlSecondsAfterFinished. I think it can make sense for the k8s dask-gateway resource controller to respect such configuration as well.

from dask-gateway.

consideRatio avatar consideRatio commented on July 17, 2024

I opened #760 about the cleanup part, closing this issue as resolved by the idle_timeout configuration.

from dask-gateway.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.