Comments (6)
Does the idle_timeout
configuration option do what you want?
from dask-gateway.
Thanks for posting that link. That looks like it might be what I'm looking for! Does it clean up the Custom Resources when it times out? I'll give it as shot. Thanks again
from dask-gateway.
I had a chance to try out that idle_timeout
and it works well. One thing I noticed is that when the idle_timeout
expires, the cluster gets deleted, but the "daskcluster" custom resource still exists. I imagine that those could accumulate over time if there were a lot of clusters spinning up and down. Is there a way to easily detect if those aren't running and clean up the Cutom Resources?
Also, I noticed that when the cluster does shut down, if there's still a python session connected, it returns a really confusing error message as below. I was wondering if there was a way to make that fail more gracefully. I didn't realize this was a timeout error until I saw that the cluster had been shut down in the (user inaccessible) logs.
root@dask-client:/src# python
Python 3.9.16 (main, Feb 11 2023, 02:49:26)
[GCC 10.2.1 20210110] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from dask_gateway import Gateway
>>>
>>> gateway = Gateway("http://traefik-dask-gateway:80")
>>> print(gateway.list_clusters())
[]
>>> cluster = gateway.new_cluster()
>>> client = cluster.get_client()
/usr/local/lib/python3.9/site-packages/distributed/client.py:1361: VersionMismatchWarning: Mismatched versions found
+-------------+----------------+----------------+---------+
| Package | Client | Scheduler | Workers |
+-------------+----------------+----------------+---------+
| dask | 2023.2.1 | 2022.12.1 | None |
| distributed | 2023.2.1 | 2022.12.1 | None |
| python | 3.9.16.final.0 | 3.11.1.final.0 | None |
+-------------+----------------+----------------+---------+
warnings.warn(version_module.VersionMismatchWarning(msg[0]["warning"]))
>>> 2023-02-26 05:45:35,766 - tornado.application - ERROR - Exception in callback <bound method Client._heartbeat of <Client: 'tls://10.244.0.12:8786' processes=0 threads=0, memory=0 B>>
Traceback (most recent call last):
File "/usr/local/lib/python3.9/site-packages/tornado/ioloop.py", line 921, in _run
val = self.callback()
File "/usr/local/lib/python3.9/site-packages/distributed/client.py", line 1445, in _heartbeat
self.scheduler_comm.send({"op": "heartbeat-client"})
AttributeError: 'NoneType' object has no attribute 'send'
2023-02-26 05:45:40,766 - tornado.application - ERROR - Exception in callback <bound method Client._heartbeat of <Client: 'tls://10.244.0.12:8786' processes=0 threads=0, memory=0 B>>
Traceback (most recent call last):
File "/usr/local/lib/python3.9/site-packages/tornado/ioloop.py", line 921, in _run
val = self.callback()
File "/usr/local/lib/python3.9/site-packages/distributed/client.py", line 1445, in _heartbeat
self.scheduler_comm.send({"op": "heartbeat-client"})
AttributeError: 'NoneType' object has no attribute 'send'
2023-02-26 05:45:45,767 - tornado.application - ERROR - Exception in callback <bound method Client._heartbeat of <Client: 'tls://10.244.0.12:8786' processes=0 threads=0, memory=0 B>>
Traceback (most recent call last):
File "/usr/local/lib/python3.9/site-packages/tornado/ioloop.py", line 921, in _run
val = self.callback()
File "/usr/local/lib/python3.9/site-packages/distributed/client.py", line 1445, in _heartbeat
self.scheduler_comm.send({"op": "heartbeat-client"})
I created a dummy project to help me test it here. This is what I used in this example.
https://github.com/JoeJasinski/dask-gateway-testing
from dask-gateway.
This may be a duplicate of #255
from dask-gateway.
One thing I noticed is that when the
idle_timeout
expires, the cluster gets deleted, but the "daskcluster" custom resource still exists.
The k8s DaskCluster
resource enters a "Stopped" state.
apiVersion: gateway.dask.org/v1alpha1
kind: DaskCluster
# ...
status:
completionTime: "2023-10-25T11:43:39Z"
credentials: dask-credentials-b3a990d302d84720aae27404f6153ade
ingressroute: dask-b3a990d302d84720aae27404f6153ade
ingressroutetcp: dask-b3a990d302d84720aae27404f6153ade
phase: Stopped
schedulerPod: dask-scheduler-b3a990d302d84720aae27404f6153ade
service: dask-b3a990d302d84720aae27404f6153ade
The question about this could pivot to "should a stopped DaskCluster resources get cleaned up directly, or after some time?".
This is similar to having k8s Job resource creating a Pod to do some work. Then the Pod and Job is left in a "Completed" state a while. There is a topic about that.
When a Job completes, no more Pods are created, but the Pods are usually not deleted either. Keeping them around allows you to still view the logs of completed pods to check for errors, warnings, or other diagnostic output. The job object also remains after it is completed so that you can view its status. It is up to the user to delete old jobs after noting their status.
CronJob, that is a k8s resource to create Job resources, can cleanup the Job resources and it creates.
Finished Jobs are usually no longer needed in the system. Keeping them around in the system will put pressure on the API server. If the Jobs are managed directly by a higher level controller, such as CronJobs, the Jobs can be cleaned up by CronJobs based on the specified capacity-based cleanup policy.
It appears that in k8s 1.23+ (now probably used by most k8s clusters), there is a controller reading the k8s Job resource's ttlSecondsAfterFinished
. I think it can make sense for the k8s dask-gateway resource controller to respect such configuration as well.
from dask-gateway.
I opened #760 about the cleanup part, closing this issue as resolved by the idle_timeout
configuration.
from dask-gateway.
Related Issues (20)
- Adding envs key Helm values to gateway resources #688 HOT 1
- KILLED: dask.worker_X - Killed by user request. HOT 1
- Slurm Job Fails Due to Missing SSL Certificates When Creating Cluster using dask-gateway-server HOT 2
- Project's test are failing - help to debug greatly appreciated HOT 4
- SQLAlchemy default installs v2.0, dask-gateway-server uses 1.4.x syntax HOT 2
- Should the dask-gateway helm chart disable the worker pod's nanny? HOT 1
- Unpin setuptools in dask-gateway-server's build environment HOT 1
- Ensure all config has help strings for our configuration reference docs HOT 1
- Regular 404 requests to `/` in helm chart deployment of dask-gateway server (api pod) HOT 8
- Change of controller's log level of "Reconciling cluster"
- Detail the log message when shutting down a cluster due to `idle_timeout` HOT 4
- Cleanup k8s DaskCluster resources by introducing a `ttlSecondsAfterFinished` field respected by the controller?
- Test failures in main branch
- Kubernetes controller deoesn't respect worker_cores factions correctly
- Don't always set imagePullPolicy to IfNotPresent HOT 1
- AttributeError: 'GatewayCluster' object has no attribute 'wait_for_workers' HOT 2
- Fix logged aiohttp warning about "app key"
- Decide on `wait_for_workers` implementation in client cluster object
- Tests broken again
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dask-gateway.