Comments (12)
@ericl We did some analysis and notice it's kind of hard to start monitor and keep it exact same pattern as it is in ray/core. I do think we need some changes to provides a smooth and pluggable experience. Let us add more details in the issue and we can have the discussion
from kuberay.
I wrote a design doc fleshing out the above proposals a bit more:
https://docs.google.com/document/d/1I2CYu2-hTQUJ29wPonMvCZgEiRPs1-KeqT1mzrC6LXY
Please let us know about the direction and any suggestions or improvements you might have :)
from kuberay.
It would be great to see support for in-tree autoscaling! Are there any API changes to the in-tree autoscaler or proto APIs that might make this easier to implement / maintain?
(I'm happy to work together on this issue)
from kuberay.
Cc @DmitriGekhtman, who maintains the in-tree operator.
from kuberay.
@Jeffwan could you say more about why having the autoscaler run in the head pod is preferable for the use-cases you are considering?
If I understand right, you'd also prefer the autoscaler to directly interact with K8s api server, rather than acting on a custom resource and delegating pod management to the operator.
Just curious if there are particular reasons this way of doing things works best for you, besides the fact that the Ray autoscaler is currently set up to favor this deployment strategy.
from kuberay.
I guess "in-tree autoscaler" mostly means "monitor.py" from the main Ray project.
One way to make it work is to write a NodeProvider implementation whose "create node" and "terminate node" methods act on the scale fields of the RayCluster CR.
from kuberay.
@Jeffwan could you say more about why having the autoscaler run in the head pod is preferable for the use-cases you are considering?
@DmitriGekhtman I missed your last comment. We can scope autoscaler at the cluster level which is under our expectation. Since autoscaler in the future may have different policies etc, this gives us enough flexibility to custom autoscaler for each cluster for different ray versions. (we are not end users and version upgrade takes time, it's common to have multiple versions running at the same time in the cluster)
If I understand right, you'd also prefer the autoscaler to directly interact with K8s api server, rather than acting on a custom resource and delegating pod management to the operator.
I actually prefer to ask autoscaler to update Kubernetes CRD so there's always one owner of the pods and the responsibility is clear.
from kuberay.
I guess "in-tree autoscaler" mostly means "monitor.py" from the main Ray project.
One way to make it work is to write a NodeProvider implementation whose "create node" and "terminate node" methods act on the scale fields of the RayCluster CR.
That's correct. We did some POC like below to verify the functionality but feel there're some upstream changes to make. Currently, we are not using autoscaling yet in our envs.
- CRD -> a config file autoscaler can recongnize
- operator converts CRD to config and create a ConfigMap and mount to head node
- head node start monitoring process and reads the config.
from kuberay.
All of this makes sense.
I think it might be advantageous to deploy the autoscaler as a separate deployment (scoped to a single Ray cluster). That gives more flexibility. Also, it's better for resource management -- we've observed the autoscaler using up a lot of memory under certain conditions.
Mounting a config map works. Another option is to have the autoscaler read the custom resource and do the translation to a suitable format itself, once per autoscaler iteration. This has the advantage that changes to the CR propagate faster to the autoscaler -- mounted config maps take a while to update.
from kuberay.
ray-project/ray#21086
ray-project/ray#22348
Ray upstream already have the support. Under current implementation, kuberay operator's work become easier, operator should take actions on this field to orchestrate the autoscaler. Entire process should be transparent to users
kuberay/ray-operator/api/raycluster/v1alpha1/raycluster_types.go
Lines 21 to 22 in ffa7e60
While, version management is still tricky. We should not support autoscaler for earlier Ray versions.
from kuberay.
Yep, I agree that we don't need to support the Ray autoscaler with earlier Ray versions.
from kuberay.
Major implementation is done. Let's create separate issues to track future improvements.
from kuberay.
Related Issues (20)
- [Bug] KubeRay Deployment Failure with Large ServeZip File in Working_Dir HOT 2
- Kuberay operator failed to watch endpoint HOT 3
- [Bug] Ray operator crashes when specifying RayCluster with `resources.limits` but no `resources.requests` HOT 2
- [Feature] Publish `python-client` to PyPI HOT 2
- [Feature] Ray cluster launcher support for GKE
- [Bug] All worker Pods are deleted if using KubeRay v1.0.0 CRD with KubeRay operator v1.1.0 image
- [Bug] What's the relationship between watching `Endpoints` and RayService e2e tests?
- [Bug] Priority Class Name from worker group spec not forwarded to final templated yaml files
- [Bug] Misleading error message in RayService when upgrading to KubeRay v1.1.0 HOT 1
- [Bug] Ray Head access to extra GPU resources HOT 1
- [Feature] Allow different LocalQueue label for head and worker groups HOT 6
- [Bug] No worker pods created after updating to Kuberay 1.1.0 HOT 1
- [Bug] [API Server] Can't specify cluster rayVersion in Ray Job
- [Feature] Upgrade ray version to 2.20 HOT 1
- [Feature] [API Server] [RFC] Add persistence for job history using a SQL database
- [Bug] [raycluster-controller] Kuberay cannot recreate new raycluster header pod when it has been evicted by kubelet as disk pressure HOT 1
- [Feature] RayCluster Helm Chart: Add pod level securityContext in addition to container level securityContext HOT 3
- [Feature] RayService CRD to have ImagePullSecret Reference
- [Feature] Why RayJob Spec can't set EndpointMemory? HOT 1
- [Bug] RayJob does not work when `app.kubernetes.io/name` is set HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from kuberay.