Giter VIP home page Giter VIP logo

kueue's Issues

Need to improve the readability of the log

1.6451684909657109e+09	INFO	controller-runtime.metrics	Metrics server is starting to listen	{"addr": "127.0.0.1:8080"}
1.6451684909663508e+09	INFO	setup	starting manager
1.6451684909665146e+09	INFO	Starting server	{"kind": "health probe", "addr": "[::]:8081"}
1.645168490966593e+09	INFO	Starting server	{"path": "/metrics", "kind": "metrics", "addr": "127.0.0.1:8080"}
I0218 07:14:51.066639       1 leaderelection.go:248] attempting to acquire leader lease kueue-system/c1f6bfd2.gke-internal.googlesource.com...
I0218 07:15:07.705977       1 leaderelection.go:258] successfully acquired lease kueue-system/c1f6bfd2.gke-internal.googlesource.com
1.6451685077060497e+09	DEBUG	events	Normal	{"object": {"kind":"ConfigMap","namespace":"kueue-system","name":"c1f6bfd2.gke-internal.googlesource.com","uid":"e70e4b9b-54f4-4782-a904-e57d3001c8e6","apiVersion":"v1","resourceVersion":"264201"}, "reason": "LeaderElection", "message": "kueue-controller-manager-7ff7b759bf-nszmb_05445f7f-a871-4a4c-83c1-af075b850e49 became leader"}
1.6451685077061899e+09	DEBUG	events	Normal	{"object": {"kind":"Lease","namespace":"kueue-system","name":"c1f6bfd2.gke-internal.googlesource.com","uid":"72b48bf0-20e0-42a4-823b-2a6edcb3288a","apiVersion":"coordination.k8s.io/v1","resourceVersion":"264202"}, "reason": "LeaderElection", "message": "kueue-controller-manager-7ff7b759bf-nszmb_05445f7f-a871-4a4c-83c1-af075b850e49 became leader"}
1.6451685077062488e+09	INFO	controller.queue	Starting EventSource	{"reconciler group": "kueue.x-k8s.io", "reconciler kind": "Queue", "source": "kind source: *v1alpha1.Queue"}
1.645168507706281e+09	INFO	controller.queue	Starting Controller	{"reconciler group": "kueue.x-k8s.io", "reconciler kind": "Queue"}
1.6451685077062566e+09	INFO	controller.queuedworkload	Starting EventSource	{"reconciler group": "kueue.x-k8s.io", "reconciler kind": "QueuedWorkload", "source": "kind source: *v1alpha1.QueuedWorkload"}
1.6451685077063015e+09	INFO	controller.queuedworkload	Starting Controller	{"reconciler group": "kueue.x-k8s.io", "reconciler kind": "QueuedWorkload"}
1.6451685077062776e+09	INFO	controller.capacity	Starting EventSource	{"reconciler group": "kueue.x-k8s.io", "reconciler kind": "Capacity", "source": "kind source: *v1alpha1.Capacity"}
1.6451685077063189e+09	INFO	controller.capacity	Starting Controller	{"reconciler group": "kueue.x-k8s.io", "reconciler kind": "Capacity"}
1.6451685077064047e+09	INFO	controller.job	Starting EventSource	{"reconciler group": "batch", "reconciler kind": "Job", "source": "kind source: *v1.Job"}
1.6451685077064307e+09	INFO	controller.job	Starting EventSource	{"reconciler group": "batch", "reconciler kind": "Job", "source": "kind source: *v1alpha1.QueuedWorkload"}
1.6451685077064393e+09	INFO	controller.job	Starting Controller	{"reconciler group": "batch", "reconciler kind": "Job"}
1.6451685078075259e+09	INFO	controller.queuedworkload	Starting workers	{"reconciler group": "kueue.x-k8s.io", "reconciler kind": "QueuedWorkload", "worker count": 1}
1.6451685078075113e+09	INFO	controller.capacity	Starting workers	{"reconciler group": "kueue.x-k8s.io", "reconciler kind": "Capacity", "worker count": 1}
1.645168507807566e+09	INFO	controller.queue	Starting workers	{"reconciler group": "kueue.x-k8s.io", "reconciler kind": "Queue", "worker count": 1}
1.6451685078076618e+09	INFO	controller.job	Starting workers	{"reconciler group": "batch", "reconciler kind": "Job", "worker count": 1}
1.645168507807886e+09	LEVEL(-2)	job-reconciler	Job reconcile event	{"job": {"name":"ingress-nginx-admission-create","namespace":"kube-system"}}
1.645168507808418e+09	LEVEL(-2)	job-reconciler	Job reconcile event	{"job": {"name":"ingress-nginx-admission-patch","namespace":"kube-system"}}
1.6451685078085716e+09	LEVEL(-2)	job-reconciler	Job reconcile event	{"job": {"name":"kube-eventer-init-v1.6-a92aba6-aliyun","namespace":"kube-system"}}
1.6451706903900485e+09	LEVEL(-2)	capacity-reconciler	Capacity create event	{"capacity": {"name":"cluster-total"}}
1.6451706904384277e+09	LEVEL(-2)	queue-reconciler	Queue create event	{"queue": {"name":"main","namespace":"default"}}
1.6451707150770907e+09	LEVEL(-2)	job-reconciler	Job reconcile event	{"job": {"name":"sample-job-jjbq2","namespace":"default"}}
1.6451707150895817e+09	LEVEL(-2)	queued-workload-reconciler	QueuedWorkload create event	{"queuedWorkload": {"name":"sample-job-jjbq2","namespace":"default"}, "queue": "main", "status": "pending"}
1.645170715089716e+09	LEVEL(-2)	scheduler	Workload assumed in the cache	{"queuedWorkload": {"name":"sample-job-jjbq2","namespace":"default"}, "capacity": "cluster-total"}
1.6451707150901928e+09	LEVEL(-2)	job-reconciler	Job reconcile event	{"job": {"name":"sample-job-jjbq2","namespace":"default"}}
1.6451707150984285e+09	LEVEL(-2)	scheduler	Successfully assigned capacity and resource flavors to workload	{"queuedWorkload": {"name":"sample-job-jjbq2","namespace":"default"}, "capacity": "cluster-total"}
1.6451707150985863e+09	LEVEL(-2)	queued-workload-reconciler	QueuedWorkload update event	{"queuedWorkload": {"name":"sample-job-jjbq2","namespace":"default"}, "queue": "main", "capacity": "cluster-total", "status": "assigned", "prevStatus": "pending", "prevCapacity": ""}
1.6451707150986767e+09	LEVEL(-2)	job-reconciler	Job reconcile event	{"job": {"name":"sample-job-jjbq2","namespace":"default"}}

We can chose to switch to klog/v2.

Enhance Makefile arguments for img building and pushing

Current Makefile doesn't provide flexible ways to modify how I want to build and push the image

VERSION := $(shell git describe --tags --dirty --always)
# Image URL to use all building/pushing image targets
IMAGE_BUILD_CMD ?= docker build
IMAGE_PUSH_CMD ?= docker push
IMAGE_BUILD_EXTRA_OPTS ?=
IMAGE_REGISTRY ?= k8s.gcr.io/kueue
IMAGE_NAME := controller
IMAGE_TAG_NAME ?= $(VERSION)
IMAGE_EXTRA_TAG_NAMES ?=
IMAGE_REPO ?= $(IMAGE_REGISTRY)/$(IMAGE_NAME)
IMAGE_TAG ?= $(IMAGE_REPO):$(IMAGE_TAG_NAME)
BASE_IMAGE_FULL ?= golang:1.17

Also in order to be more generic, rename
docker-image to simply image or image-build
docker-push to simply push or image-push

This provides more flexibility when developing in a non docker environment, like buildah , podman or even building the image with CI tool on kubernetes it self.

/kind feature

Add info to Queue status

Suggestions:

  • Number of pending jobs
  • Number of started jobs
  • Resources currently used by the queue.

/kind feature

Validating that flavors of a resource are different

What if we validate that the flavors of a resource in a capacity have at least on common label key with different values?

This practically forces that each flavor is pointing to different sets of nodes.

Consider a diff image for testing/samples

I am running into

  Warning  Failed     11s   kubelet            Failed to pull image "perl": rpc error: code = Unknown desc = reading manifest latest in docker.io/library/perl: toomanyrequests: You have reached your pull rate limit. You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit

we might want to consider an image out of a diff registry to avoid this unfortunate error

/kind test

Make unit tests run at least 3 times

We should not allow any flakiness in our unit tests. The prow job should run the tests with -race -count 3

Leave the option in the Makefile to run the tests only once (by default), as it's likely useful during development.

/priority important-soon
/kind cleanup

Match workload affinity with capacity labels

During workload scheduling, a workload's node affinities and selectors should be matched against the labels of the resource flavors. This allows a workload to specify which exact flavors to use, or even force a different evaluation order of the flavors than that defined by the capacity.

/kind feature

Add scheduler integration tests

We have one that covers the job-controller on its own, we need a test that cover all other controllers together that includes creating queue, capacity and multiple jobs, and inspect that jobs are started as expected.

Match workload tolerations with capacity taints

During workload scheduling, a workload's tolerations should be matched against the taints of the resource flavors. This allows a workload to opt-in to specific flavors.

/kind feature
/priority important-soon

Add events that tracks a workload's status

Two possible locations to issue events:

  • when it is assigned a capacity in the scheduling loop.
  • in the job-controller when a corresponding workload is created.

/kind feature

[Umbrella] ☂️ Requirements for release 0.1.0

Deadline: May 16th Kubecon EU

Issues that we need to complete to consider kueue ready for a first release:

  • Match workload affinities with flavors #3
  • Single heap per Capacity #87
  • Consistent flavors in a cohort #59
  • Queue status #5
  • Capacity status #7
  • Event for unschedulable workloads #91
  • Capacity namespace selector #4
  • Efficient requeuing #8
  • User guide #64
  • Publish image #52

Nice to have:

  • Add borrowing weight #62
  • E2E test #61
  • Use kueue.sigs.k8s.io API group #23
  • Support for one custom job #65

Support for workload preemption

Preemption can be useful to reclaim borrowed capacity, however the obvious tradeoff is interrupting workloads and potentially losing significant progress.

There are two high-level design decision we need to make and whether they should be tunable:

  1. What triggers preemption? reclaiming borrowed capacity? workload priority?
  2. What is the scope? is preemption is cohort knob? a capacity knob? a queue knob?

Support Argo/Tekton workflows

This is lower priority than #65, but it would be good to have an integration with a workflow framework.

Argo supports the suspend flag, the tricky part is that suspend is for the whole workflow, meaning a QueuedWorkload would need to represent the resources of the whole workflow all at once.

Ideally Argo should create jobs per sequential step, and then resource reservation happens one step at a time.

Flavors with matching names should have identical labels/taints

A capacity can borrow resources from flavors matching the names of ones defined in the capacity. Those flavors with matching names should also have identical labels and taints.

One solution is to define a cluster-scoped object API that represents resource flavors that capacities refer to by name when setting a quota. It would look like this:

type ResourceFlavorSpec struct {  
  // the object name serves as the flavor name, e.g., nvidia-tesla-k80. 

  // resource is the resource name, e.g., nvidia.com/gpus.   
  Resource v1.ResourceName  

  // labels associated with this flavor. Those labels are matched against or  
  // converted to node affinity constraints on the workload’s pods.  
  // For example, cloud.provider.com/accelerator: nvidia-tesla-k80.  
  Labels map[string]string  

  // taints associated with this constraint that workloads must explicitly   
  // “tolerate” to be able to use this flavor.  
  // e.g., cloud.provider.com/preemptible="true":NoSchedule  
  Taints      []Taint
}

This will avoid duplicating labels/taints on each capacity and so makes it easier to create a cohort of capacities with similar resources.

The downside is of course now we have another resource that the batch admin needs to deal with. But I expect that the number of flavors will typically be small.

Ensure test cases are independent

In an effort to get a binary that "works", we wrote some tests where a test case depends on the state left by previous test cases.

This is problematic for debugging problems and it tends to lead to a lot of test changes when there is a behavior change or you want to insert a case in the middle of the existing ones.

Places that I'm aware of:

And there are similar situations in the following, but it's more like a single complex test case in each:

/priority backlog

Make the GPU a prime citizen in kueue

Hello fellow HPC and batch enthusiasts, I have read your public doc with much interest and I have seen that the GPU is mentioned a couple of times. To make kueue and GPUs a success story I think we need to align the requirements that kueue needs for scheduling with our k8s stack which should expose the right information that you need to make the right scheduling decisions.

There are dedicated GPUs, MIG slices, vGPU either time shared or MIG backed, those are all features that need to be taken into consideration. Going further if we're doing multi-node with MPI and such, we need to think also about network topologies and node interconnects. You may rather use nodes that have GPUDirect enabled than nodes that have "only" a GPU with a slow ethernet connection.

I am one of the tech-leads for accelerator enablement on Kubernetes at NVIDIA and I am happy to help to move this forward.

Support for hierarchical ClusterQueues

Systems like Yarn allow creating a hierarchy of fair sharing, which allows modeling deeper organizational structures with fair-sharing.

Kueue currently supports three organizational levels: Cohort (models a business unit), ClusterQueue (models divisions within a business unit), namespace (models teams within a division). However fair-sharing is only supported at one level, within a cohort.

We opted-out of supporting hierarchy from the beginning for two reasons: (1) it adds complexity to both the API and implementation; (2) it is also not clear that in practice customers need more than two levels of sharing which is what the current model enables and seems to work for other frameworks like Slurm and HTCondor.

As Kueue evolves we likely need to revisit this decision.

Replace borrowing ceiling with weight

bit.ly/kueue-apis defined a weight to dynamically set a borrowing ceiling for each Capacity, based on the total resources in the Cohort and the capacities that have pending workloads.

We need to implement such behavior and remove the ceiling.
The weights and unused resources should lead to a dynamic ceiling that is calculated in every scheduling cycle. The exact semantics of this calculation are not fully understood.
In a given scheduling cycle, which capacities are considered for splitting the unused resources? Only the ones with pending jobs? What about the ones that are already borrowing but have no more pending jobs? What is considered unused resources once some resources have already being borrowed?

There are probably a few interpretations to these questions that lead to slightly different results. We need to explore them and pick one that sounds more reasonable or is based on existing systems.

controller.kubernetes.io/queue-name annotation not registered

The code in this repo uses an annotation, controller.kubernetes.io/queue-name, that is not registered in https://kubernetes.io/docs/reference/labels-annotations-taints/

We should either:

  • register and document the annotation
  • avoid specifying controller.kubernetes.io as the namespace for that annotation, and instead require specifying it as a command line option to the app. That way, end-users wouldn't assume that any particular namespace is expected.
  • use another namespace, that is appropriate for kueue.

Dynamically reclaiming resources

Currently a job's resources are reclaimed by Kueue only when the whole job finishes; for jobs with multiple pods, this entails waiting until the last pod finishes. This is not efficient as the pods of a parallel job may have laggards consuming little resources compared to the overall job.

One solution is to continuously update the Workload object with the number of completed pods so that Kueue can gradually reclaim the resources of those pods.

Support kubeflow's MPIJob

That is kubeflow's mpi-operator. We could have started with other custom jobs, but this one seems important enough for our audience.

They currently don't have a suspend field, so we need to add it. Then, we program the controller based on the existing kueue job-controller.

/label feature
/size L
/priority important-longterm

Add scalability tests

This is critical to better understand kueue's limits and where its bottlenecks. We should check if there is a way to use clusterloader for this

Support dynamically sized (elastic) jobs

We should have a clear path towards support spark and other dynamically sized jobs. Another example of this is Ray.

One related aspect is to support dynamically updating the resource requirements of a workload, we can probably limit that to support changing the count of a PodSet in QueuedWorkload (in Spark, the number of workers could change during the runtime of the job, but not the resource requirements of a worker).

One idea is to model it in a way similar to "in-place update to pod resources" [1], but in our case it would be the count that is mutable. The driver pod in spark would be watching for the corresponding QueuedWorkload instance and adjusts the number of workers when the new count is admitted.

[1] https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/1287-in-place-update-pod-resources

Add support for budgets

Currently ClusterQueue supports usage limits at a specific point in time. A common use case is for batch admins to set up budgets, meaning usage limits over periods of time; for example, x cores over over a period of one month.

Graduate API to beta

Currently, this would be very cumbersome due to the lack of support from kubebuilder kubernetes-sigs/controller-tools#656

Once the support is added and we are ready to publish a v1beta1, we should consider renaming the api group. Note that this requires an official api-review kubernetes/enhancements#1111

Summary doc: https://docs.google.com/document/d/1Uu4hfGxux4Wh_laqZMLxXdEVdty06Sb2DwB035hj700/edit?usp=sharing&resourcekey=0-b7mU7mGPCkEfhjyYDsXOBg (join https://groups.google.com/a/kubernetes.io/g/wg-batch to access)

Potential changes when graduating:

  • Move admission from Workload spec into status (from #498)
  • Rename min, max into something easier to understand.
  • Support queue name as a label, in addition to annotation (makes it easier to filter workloads by queue).
  • Add ObjectMeta into each PodSet template.

Brainstorm enhancing UX

We are adding more information to statuses of the various APIs we have (#7 and #5); but I am wondering what other UX-related enhancements we should pursue for the two personas: batch admin and batch user.

UX gets users excited about the system and I think should be a focal point as Kueue evolves.

Add workload priority

This is a placeholder to discuss priority semantics.

We can have it at the workload level or queue level.

Add user guide

/kind feature
/size M

Something more comprehensive that the existing README. Some of the use cases in bit.ly/kueue-apis can be dumped into samples/guides.

If possible, generate some documentation out of the APIs, similar to https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.23/

Contents (not necessarily each one will be a page, but they could be sections on existing pages).

  • Single CQ setup
  • Multiple flavors
  • Multiple CQ setup (cohorts)
  • Namespace selectors
  • Cohorts
  • Running a Job
  • Configuring RBAC
  • Monitoring usage (kubectl describe)

Fix go lint warnings

After running golang-ci and gofmt, the following is showed

pkg/capacity/capacity_test.go:37:19: Error return value is not checked (errcheck)                                      
        kueue.AddToScheme(scheme)                                                                                                                                                                                                             
                         ^                  
pkg/capacity/capacity_test.go:66:23: Error return value of `cache.AddCapacity` is not checked (errcheck)               
                                        cache.AddCapacity(context.Background(), &c)                                                                                                                                                           
                                                         ^                                                                                                                                                                                    
pkg/capacity/capacity_test.go:89:26: Error return value of `cache.UpdateCapacity` is not checked (errcheck)            
                                        cache.UpdateCapacity(&c)                                                       
                                                            ^                                                                                                                                                                                 
pkg/capacity/capacity_test.go:203:19: Error return value is not checked (errcheck)
        kueue.AddToScheme(scheme)                                                                                                                                                                                                             
                         ^                                                                                                                                                                                                                    
pkg/capacity/snapshot_test.go:38:19: Error return value is not checked (errcheck)                                                                                                                                                             
        kueue.AddToScheme(scheme)                                                                                                                                                                                                             
                         ^                                                                                                                                                                                                                    
pkg/capacity/snapshot_test.go:122:20: Error return value of `cache.AddCapacity` is not checked (errcheck)              
                cache.AddCapacity(context.Background(), &cap)                                                          
                                 ^                                                                                     
pkg/queue/manager.go:61:21: Error return value of `qImpl.setProperties` is not checked (errcheck)
        qImpl.setProperties(q)                   
                           ^                                                                                           
pkg/queue/manager_test.go:394:19: Error return value of `manager.AddQueue` is not checked (errcheck)                   
                manager.AddQueue(ctx, &q)                                                                              
                                ^                          
pkg/queue/manager_test.go:452:20: Error return value of `manager.AddQueue` is not checked (errcheck)                   
                        manager.AddQueue(ctx, &q)                                                                      
                                        ^                                                                                                                                                                                                     
pkg/queue/manager_test.go:462:19: Error return value of `manager.AddQueue` is not checked (errcheck)
                manager.AddQueue(ctx, &q)                                                                              
                                ^                   
pkg/scheduler/scheduler.go:200:32: Error return value of `s.capacityCache.AssumeWorkload` is not checked (errcheck)
        s.capacityCache.AssumeWorkload(newWorkload)                                                                                                                                                                                           
                                      ^                                                                                
pkg/capacity/capacity.go:120:2: S1023: redundant `return` statement (gosimple)                                         
        return                                                                                                         
        ^                                                                                                                                                                                                                                     
pkg/capacity/snapshot_test.go:292:4: SA9003: empty branch (staticcheck)
                        if m == nil {                                                                                                                                                                                                         
                        ^
make: *** [Makefile:73: ci-lint] Error 1

Make sure assumed workloads are deleted when the object is deleted

Since the scheduler works on a snapshot, it's possible that a workload is deleted between the time we get it from a queue and when we assume it.

We should check the client cache before Assuming a workload to make sure it still exists.

Also, when a workload is deleted, we should clear the cache even if the workload API object is not assigned (regardless of DeleteStateUnknown). This is because the workload could be deleted between the time the scheduler Assumes a workload and it updates the assignment in the API.

/kind bug

Publish kueue in GCR

We don't necessarily need to wait for a production-ready version. We can publish alpha/beta builds

/kind feature

ClusterQueue updates/deletions and running workloads

With regards to CQ deletions, perhaps we can inject finalizers to block the delete until all running workloads finish, at the same time stop admitting new workloads.

What about CQ updates? One simple solution is to make everything immutable, and so updating a CQ is only possible by recreating it, hence we reduce update to a delete which we already handled above. We can relax this a little by allowing the following updates:

  1. an increase to existing quota
  2. adding new resources and/or flavors
  3. setting a cohort only if it was not set before

all of those updates don't impact running workloads and can be done without checking for current usage levels.

/kind feature

Set pending condition on QueuedWorkload with message

A queued workload can be pending for several reasons:

  • The Queue doesn't exist
  • The ClusterQueue doesn't exist
  • The QW's namespace is not allowed by the ClusterQueue
  • The workload was attempted for scheduling but it didn't fit.

We need to find a way to set this information.

Probably the first 2 can happen in the queuedworkload_controller, after every update.
The other 2 should probably during scheduling.

/kind feature

Rename Capacity to ClusterQueue

Capacity not only defines usage limits for a set of tenants, but it is the level at which ordering will be done for workloads submitted to queues sharing a capacity.

Renaming Capacity to ClusterQueue could provide clarify, with Queue being the namespaced equivalent serving two purposes:

  1. discoverability: tenants can simply list the queues that exist in their namespace to find which ones they can submit their workloads to, so it is simply a pointer to the cluster-scoped ClusterQueue.
  2. address the use case where a tenant is running an experiment and want to define usage limits for that experiment; in this use case an experiment is modeled as a queue; which means tenants should be able to create/delete queues as they see fit.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.