Giter VIP home page Giter VIP logo

Comments (4)

udeet27 avatar udeet27 commented on July 17, 2024 1

Hi I see here a message is already displayed when idle_timeout is exceeded. Do I need to implement something similar in some other file? Any guidance would be much appreciated as I'm not deeply familiar with the codebase.

from dask-gateway.

consideRatio avatar consideRatio commented on July 17, 2024 1

I see no action point that seems reasonable to go for any more in this issue. It would be to provide a "reason" and propegate that from the scheduler, but that may be a bit too complicated and require touching a lot of things - so I don't think its worth doing.

I'll go for a close on this issue @udeet27, THANK YOU for initiating an investigation!! I'm sorry it was an issue that didn't turn out resolvable =/

from dask-gateway.

udeet27 avatar udeet27 commented on July 17, 2024 1

Ohh wow. It's a lot more complicated than I initially anticipated. Thanks for the detailed explanation. I'll look into the other issues and see if I can contribute in them.

from dask-gateway.

consideRatio avatar consideRatio commented on July 17, 2024

@udeet27 I'm don't overview the code base so well either so I had to dig in myself to help, doing so I was left uncertain what to do - because this can't be fixed easily. In brief, there were the controller, the dask-gateway-server, and the dask-scheduler. The idle_timeout was logged by the scheduler, but communicated a shutdown to the dask-gateway-server, that made the controller do the job, but no information was passed from the scheduler about why the cluster was to be terminated. So, there is no way for the dask-gateway-server to convey that to the controller either etc.


Looking in this search I found this:

def get_scheduler_command(self, namespace, cluster_name, config):
return config.scheduler_cmd + [
"--protocol",
"tls",
"--host",
"",
"--port",
"8786",
"--dashboard-address",
":8787",
"--dg-api-address",
":8788",
"--preload",
"dask_gateway.scheduler_preload",
"--dg-heartbeat-period",
"0",
"--dg-adaptive-period",
str(config.adaptive_period),
"--dg-idle-timeout",
str(config.idle_timeout),
]

Okay hmm, it seems that this is how things work:

If a dask-cluster is created, its the dask-cluster's scheduler that is responsible for shutting down the cluster. So, the scheduler is logging that it is terminating the cluster its part of, and as part of that.


  1. A dask-gateway client somewhere asks the dask-gateway server to start a DaskCluster
  2. A dask cluster is created using a KubeBackend, that creates a k8s DaskCluster resource that is managed by a "controller" looking at DaskCluster resources
  3. The controller sees the DaskCluster resource and creates a dask cluster scheduler
  4. The dask cluster scheduler is monitoring its own activity, and ask the dask-gateway server to terminate the cluster the scheduler is managing when having idled for too long - when it does - it doesn't pass a reason or similar for terminating.
  5. The dask-gateway server receives the request to terminate the cluster, but doesn't understand its due to inactivity. The dask-gateway server makes the KubeBackend terminate the cluster, which it does by updating the DaskCluster k8s resources to "Stopped" I think
  6. The controller sees that the status update, and shuts down the scheduler and workers for the dask cluster.

from dask-gateway.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.