kubernetes-sigs / kueue Goto Github PK

View Code? Open in Web Editor NEW

1.2K 14.0 200.0 9.92 MB

Kubernetes-native Job Queueing

Home Page: https://kueue.sigs.k8s.io

License: Apache License 2.0

Dockerfile 0.06% Makefile 0.64% Go 96.32% Shell 1.66% SCSS 0.19% HTML 0.53% Smarty 0.18% Python 0.41%

k8s-sig-scheduling k8s kubernetes

kueue's Introduction

Kueue

Kueue is a set of APIs and controller for job queueing. It is a job-level manager that decides when a job should be admitted to start (as in pods can be created) and when it should stop (as in active pods should be deleted).

Read the overview to learn more.

Features overview

Job management: Support job queueing based on priorities with different strategies: StrictFIFO and BestEffortFIFO.
Resource management: Support resource fair sharing and preemption with a variety of policies between different tenants.
Dynamic resource reclaim: A mechanism to release quota as the pods of a Job complete.
Resource flavor fungibility: Quota borrowing or preemption in ClusterQueue and Cohort.
Integrations: Built-in support for popular jobs, e.g. BatchJob, Kubeflow training jobs, RayJob, RayCluster, JobSet, plain Pod.
System insight: Build-in prometheus metrics to help monitor the state of the system, as well as Conditions.
AdmissionChecks: A mechanism for internal or external components to influence whether a workload can be admitted.
Advanced autoscaling support: Integration with cluster-autoscaler's provisioningRequest via admissionChecks.
Sequential admission: A simple implementation of all-or-nothing scheduling.
Partial admission: Allows jobs to run with a smaller parallelism, based on available quota, if the application supports it.

Production Readiness status

✔️ API version: v1beta1, respecting Kubernetes Deprecation Policy
✔️ Up-to-date documentation.
✔️ Test Coverage:
- ✔️ Unit Test testgrid.
- ✔️ Integration Test testgrid
- ✔️ E2E Tests for Kubernetes 1.25, 1.26, 1.27, 1.28 on Kind.
✔️ Scalability verification via performance tests.
✔️ Monitoring via metrics.
✔️ Security: RBAC based accessibility.
✔️ Stable release cycle(2-3 months) for new features, bugfixes, cleanups.
✔️ Adopters running on production.

Based on community feedback, we continue to simplify and evolve the API to address new use cases.

Installation

Requires Kubernetes 1.22 or newer.

To install the latest release of Kueue in your cluster, run the following command:

kubectl apply --server-side -f https://github.com/kubernetes-sigs/kueue/releases/download/v0.6.2/manifests.yaml

The controller runs in the kueue-system namespace.

Read the installation guide to learn more.

Usage

A minimal configuration can be set by running the examples:

kubectl apply -f examples/admin/single-clusterqueue-setup.yaml

Then you can run a job with:

kubectl create -f examples/jobs/sample-job.yaml

Learn more about:

Kueue concepts.
Common and advanced tasks.

Architecture

Learn more about the architecture of Kueue with the following design docs:

bit.ly/kueue-apis discusses the API proposal and a high level description of how Kueue operates. Join the mailing list to get document access.
bit.ly/kueue-controller-design presents the detailed design of the controller.

Roadmap

This is a high-level overview of the main priorities for 2023, in expected order of release:

Cooperative preemption support for workloads that implement checkpointing #477
Flavor assignment strategies, e.g. minimizing cost vs minimizing borrowing #312
Integration with cluster-autoscaler for guaranteed resource provisioning
Integration with common custom workloads #74:
- Kubeflow (TFJob, MPIJob, etc.)
- Spark
- Ray
- Workflows (Tekton, Argo, etc.)

These are features that we aim to have in the long-term, in no particular order:

Budget support #28
Dashboard for management and monitoring for administrators
Multi-cluster support

Community, discussion, contribution, and support

Learn how to engage with the Kubernetes community on the community page and the contributor's guide.

You can reach the maintainers of this project at:

Code of conduct

Participation in the Kubernetes community is governed by the Kubernetes Code of Conduct.

kueue's People

Contributors

Stargazers

Watchers

Forkers

denkensk ahg-g alculquicondor arangogutierrez jiwq rudeigerc yuzhiquan isgasho huang-wei binacs kerthcet iutx elpsyr uroy-personal ahmed-sermani novahe ystkfujii knight42 yu-croco shuheiktgw tenzen-y hiroyaonoe nayihz yibozhuang hormes jianxx borgerli axezhan heisaman kitianfresh wsxiaozhang elastic-ai tianzichenone utkarsh-singh1 isabella232 thisisprasad xunfeng1980 wzshiming kumariitr devcloudyy kincl tiwarivandana wangao1236 charlieyu1996 nanderoo richardsonjf zhuqf my-git9 epam lvyanru8200 mimowo shubhbapna a-namenon noryev fish-pro michalzylinski mwielgus syulin7 fjding lynnsong jaym-oh rbarberop alizaidis trasc yyzxw segun76 gekko0114 kunwuluan moficodes sunday-00 spread0x k8s-infra-cherrypick-robot danielvegamyhre z1cheng hangscer8 iq-scm clearnicki kannon92 michael-xing jtorrex cpanato nstogner bh-tt researchapps madsenwattiq binl233 dragon-flyings kbakk alignmentresearch tea4menono jangholm echogroot homily707 kyle-google fzhg pbundyra asm582 chenxi-seu roopeshvs web-logs2

kueue's Issues

Add support for budgets

Currently ClusterQueue supports usage limits at a specific point in time. A common use case is for batch admins to set up budgets, meaning usage limits over periods of time; for example, x cores over over a period of one month.

Support for hierarchical ClusterQueues

Systems like Yarn allow creating a hierarchy of fair sharing, which allows modeling deeper organizational structures with fair-sharing.

Kueue currently supports three organizational levels: Cohort (models a business unit), ClusterQueue (models divisions within a business unit), namespace (models teams within a division). However fair-sharing is only supported at one level, within a cohort.

We opted-out of supporting hierarchy from the beginning for two reasons: (1) it adds complexity to both the API and implementation; (2) it is also not clear that in practice customers need more than two levels of sharing which is what the current model enables and seems to work for other frameworks like Slurm and HTCondor.

As Kueue evolves we likely need to revisit this decision.

Graduate API to beta

Currently, this would be very cumbersome due to the lack of support from kubebuilder kubernetes-sigs/controller-tools#656

Once the support is added and we are ready to publish a v1beta1, we should consider renaming the api group. Note that this requires an official api-review kubernetes/enhancements#1111

Summary doc: https://docs.google.com/document/d/1Uu4hfGxux4Wh_laqZMLxXdEVdty06Sb2DwB035hj700/edit?usp=sharing&resourcekey=0-b7mU7mGPCkEfhjyYDsXOBg (join https://groups.google.com/a/kubernetes.io/g/wg-batch to access)

Potential changes when graduating:

Move admission from Workload spec into status (from #498)
Rename min, max into something easier to understand.
Support queue name as a label, in addition to annotation (makes it easier to filter workloads by queue).
Add ObjectMeta into each PodSet template.

Dynamically reclaiming resources

Currently a job's resources are reclaimed by Kueue only when the whole job finishes; for jobs with multiple pods, this entails waiting until the last pod finishes. This is not efficient as the pods of a parallel job may have laggards consuming little resources compared to the overall job.

One solution is to continuously update the Workload object with the number of completed pods so that Kueue can gradually reclaim the resources of those pods.

Some variables collide with buildin func

We'd better rename these variables to avoid unnecessary troubles. Such as cap/copy/new and so on.

/kind cleanup

Fix go lint warnings

After running golang-ci and gofmt, the following is showed

pkg/capacity/capacity_test.go:37:19: Error return value is not checked (errcheck)                                      
        kueue.AddToScheme(scheme)                                                                                                                                                                                                             
                         ^                  
pkg/capacity/capacity_test.go:66:23: Error return value of `cache.AddCapacity` is not checked (errcheck)               
                                        cache.AddCapacity(context.Background(), &c)                                                                                                                                                           
                                                         ^                                                                                                                                                                                    
pkg/capacity/capacity_test.go:89:26: Error return value of `cache.UpdateCapacity` is not checked (errcheck)            
                                        cache.UpdateCapacity(&c)                                                       
                                                            ^                                                                                                                                                                                 
pkg/capacity/capacity_test.go:203:19: Error return value is not checked (errcheck)
        kueue.AddToScheme(scheme)                                                                                                                                                                                                             
                         ^                                                                                                                                                                                                                    
pkg/capacity/snapshot_test.go:38:19: Error return value is not checked (errcheck)                                                                                                                                                             
        kueue.AddToScheme(scheme)                                                                                                                                                                                                             
                         ^                                                                                                                                                                                                                    
pkg/capacity/snapshot_test.go:122:20: Error return value of `cache.AddCapacity` is not checked (errcheck)              
                cache.AddCapacity(context.Background(), &cap)                                                          
                                 ^                                                                                     
pkg/queue/manager.go:61:21: Error return value of `qImpl.setProperties` is not checked (errcheck)
        qImpl.setProperties(q)                   
                           ^                                                                                           
pkg/queue/manager_test.go:394:19: Error return value of `manager.AddQueue` is not checked (errcheck)                   
                manager.AddQueue(ctx, &q)                                                                              
                                ^                          
pkg/queue/manager_test.go:452:20: Error return value of `manager.AddQueue` is not checked (errcheck)                   
                        manager.AddQueue(ctx, &q)                                                                      
                                        ^                                                                                                                                                                                                     
pkg/queue/manager_test.go:462:19: Error return value of `manager.AddQueue` is not checked (errcheck)
                manager.AddQueue(ctx, &q)                                                                              
                                ^                   
pkg/scheduler/scheduler.go:200:32: Error return value of `s.capacityCache.AssumeWorkload` is not checked (errcheck)
        s.capacityCache.AssumeWorkload(newWorkload)                                                                                                                                                                                           
                                      ^                                                                                
pkg/capacity/capacity.go:120:2: S1023: redundant `return` statement (gosimple)                                         
        return                                                                                                         
        ^                                                                                                                                                                                                                                     
pkg/capacity/snapshot_test.go:292:4: SA9003: empty branch (staticcheck)
                        if m == nil {                                                                                                                                                                                                         
                        ^
make: *** [Makefile:73: ci-lint] Error 1

Make the GPU a prime citizen in kueue

Hello fellow HPC and batch enthusiasts, I have read your public doc with much interest and I have seen that the GPU is mentioned a couple of times. To make kueue and GPUs a success story I think we need to align the requirements that kueue needs for scheduling with our k8s stack which should expose the right information that you need to make the right scheduling decisions.

There are dedicated GPUs, MIG slices, vGPU either time shared or MIG backed, those are all features that need to be taken into consideration. Going further if we're doing multi-node with MPI and such, we need to think also about network topologies and node interconnects. You may rather use nodes that have GPUDirect enabled than nodes that have "only" a GPU with a slow ethernet connection.

I am one of the tech-leads for accelerator enablement on Kubernetes at NVIDIA and I am happy to help to move this forward.

Support dynamically sized (elastic) jobs

We should have a clear path towards support spark and other dynamically sized jobs. Another example of this is Ray.

One related aspect is to support dynamically updating the resource requirements of a workload, we can probably limit that to support changing the count of a PodSet in QueuedWorkload (in Spark, the number of workers could change during the runtime of the job, but not the resource requirements of a worker).

One idea is to model it in a way similar to "in-place update to pod resources" [1], but in our case it would be the count that is mutable. The driver pod in spark would be watching for the corresponding QueuedWorkload instance and adjusts the number of workers when the new count is admitted.

[1] https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/1287-in-place-update-pod-resources

Preserve order of podsets in workload info

TotalRequests in workload.Info is currently a map, when iterating over it to assign resources, we will loose the original order of the podsets.

When scheduling, a podset gets assigned flavors depending on the iteration order of this map, and so the assignment will not be deterministic.

This actually caused a flake in the following test:

kueue/pkg/scheduler/scheduler_test.go

Line 406 in b86eeb1

"assign multiple resources and flavors": {

/kind bug

Set pending condition on QueuedWorkload with message

A queued workload can be pending for several reasons:

The Queue doesn't exist
The ClusterQueue doesn't exist
The QW's namespace is not allowed by the ClusterQueue
The workload was attempted for scheduling but it didn't fit.

We need to find a way to set this information.

Probably the first 2 can happen in the queuedworkload_controller, after every update.
The other 2 should probably during scheduling.

/kind feature

Add NamespaceSelector to ClusterQueue

NamespaceSelector in capacity allows controlling which namespaces are allowed to use the capacity.

/kind feature

Support Argo/Tekton workflows

This is lower priority than #65, but it would be good to have an integration with a workflow framework.

Argo supports the suspend flag, the tricky part is that suspend is for the whole workflow, meaning a QueuedWorkload would need to represent the resources of the whole workflow all at once.

Ideally Argo should create jobs per sequential step, and then resource reservation happens one step at a time.

Match workload affinity with capacity labels

During workload scheduling, a workload's node affinities and selectors should be matched against the labels of the resource flavors. This allows a workload to specify which exact flavors to use, or even force a different evaluation order of the flavors than that defined by the capacity.

/kind feature

Flavors with matching names should have identical labels/taints

A capacity can borrow resources from flavors matching the names of ones defined in the capacity. Those flavors with matching names should also have identical labels and taints.

One solution is to define a cluster-scoped object API that represents resource flavors that capacities refer to by name when setting a quota. It would look like this:

type ResourceFlavorSpec struct {  
  // the object name serves as the flavor name, e.g., nvidia-tesla-k80. 

  // resource is the resource name, e.g., nvidia.com/gpus.   
  Resource v1.ResourceName  

  // labels associated with this flavor. Those labels are matched against or  
  // converted to node affinity constraints on the workload’s pods.  
  // For example, cloud.provider.com/accelerator: nvidia-tesla-k80.  
  Labels map[string]string  

  // taints associated with this constraint that workloads must explicitly   
  // “tolerate” to be able to use this flavor.  
  // e.g., cloud.provider.com/preemptible="true":NoSchedule  
  Taints      []Taint
}

This will avoid duplicating labels/taints on each capacity and so makes it easier to create a cohort of capacities with similar resources.

The downside is of course now we have another resource that the batch admin needs to deal with. But I expect that the number of flavors will typically be small.

Support for workload preemption

Preemption can be useful to reclaim borrowed capacity, however the obvious tradeoff is interrupting workloads and potentially losing significant progress.

There are two high-level design decision we need to make and whether they should be tunable:

What triggers preemption? reclaiming borrowed capacity? workload priority?
What is the scope? is preemption is cohort knob? a capacity knob? a queue knob?

ClusterQueue updates/deletions and running workloads

With regards to CQ deletions, perhaps we can inject finalizers to block the delete until all running workloads finish, at the same time stop admitting new workloads.

What about CQ updates? One simple solution is to make everything immutable, and so updating a CQ is only possible by recreating it, hence we reduce update to a delete which we already handled above. We can relax this a little by allowing the following updates:

an increase to existing quota
adding new resources and/or flavors
setting a cohort only if it was not set before

all of those updates don't impact running workloads and can be done without checking for current usage levels.

/kind feature

Setup a prow job for unit and integration tests

/kind feature

For better organization, also split the integration tests in a dedicated folder.

transfer group from `gke-internal.googlesource.com` to `x-k8s.io`

As the title
/kind bug

Make unit tests run at least 3 times

We should not allow any flakiness in our unit tests. The prow job should run the tests with -race -count 3

Leave the option in the Makefile to run the tests only once (by default), as it's likely useful during development.

/priority important-soon
/kind cleanup

Make sure assumed workloads are deleted when the object is deleted

Since the scheduler works on a snapshot, it's possible that a workload is deleted between the time we get it from a queue and when we assume it.

We should check the client cache before Assuming a workload to make sure it still exists.

Also, when a workload is deleted, we should clear the cache even if the workload API object is not assigned (regardless of DeleteStateUnknown). This is because the workload could be deleted between the time the scheduler Assumes a workload and it updates the assignment in the API.

/kind bug

Introduce a single heap for per ClusterQueue

To prevent users from hijacking a Capacity by creating multiple Queues, we should have a single heap for a Capacity (spin off from #80 (comment))

/kind feature

Simple Framework to support different queuing policies

We need a simple framework to support different policies or algorithms for every phases in Job scheduling.

/kind feature
/cc @ahg-g @alculquicondor

Replace borrowing ceiling with weight

bit.ly/kueue-apis defined a weight to dynamically set a borrowing ceiling for each Capacity, based on the total resources in the Cohort and the capacities that have pending workloads.

We need to implement such behavior and remove the ceiling.
The weights and unused resources should lead to a dynamic ceiling that is calculated in every scheduling cycle. The exact semantics of this calculation are not fully understood.
In a given scheduling cycle, which capacities are considered for splitting the unused resources? Only the ones with pending jobs? What about the ones that are already borrowing but have no more pending jobs? What is considered unused resources once some resources have already being borrowed?

There are probably a few interpretations to these questions that lead to slightly different results. We need to explore them and pick one that sounds more reasonable or is based on existing systems.

Make the process of `calculateRequirementsForAssignments` parallel.

This can be done in parallel

kueue/pkg/scheduler/scheduler.go

Lines 74 to 76 in f3b25fd

 // 3. Calculate requirements for assigning workloads to capacities 

 // (resource flavors, borrowing). 

 entries := calculateRequirementsForAssignments(log, headWorkloads, snapshot)

We should check what else can be parallelized and set the number of threads via configuration.

/kind feature

Publish kueue in GCR

We don't necessarily need to wait for a production-ready version. We can publish alpha/beta builds

/kind feature

Need to improve the readability of the log

1.6451684909657109e+09	INFO	controller-runtime.metrics	Metrics server is starting to listen	{"addr": "127.0.0.1:8080"}
1.6451684909663508e+09	INFO	setup	starting manager
1.6451684909665146e+09	INFO	Starting server	{"kind": "health probe", "addr": "[::]:8081"}
1.645168490966593e+09	INFO	Starting server	{"path": "/metrics", "kind": "metrics", "addr": "127.0.0.1:8080"}
I0218 07:14:51.066639       1 leaderelection.go:248] attempting to acquire leader lease kueue-system/c1f6bfd2.gke-internal.googlesource.com...
I0218 07:15:07.705977       1 leaderelection.go:258] successfully acquired lease kueue-system/c1f6bfd2.gke-internal.googlesource.com
1.6451685077060497e+09	DEBUG	events	Normal	{"object": {"kind":"ConfigMap","namespace":"kueue-system","name":"c1f6bfd2.gke-internal.googlesource.com","uid":"e70e4b9b-54f4-4782-a904-e57d3001c8e6","apiVersion":"v1","resourceVersion":"264201"}, "reason": "LeaderElection", "message": "kueue-controller-manager-7ff7b759bf-nszmb_05445f7f-a871-4a4c-83c1-af075b850e49 became leader"}
1.6451685077061899e+09	DEBUG	events	Normal	{"object": {"kind":"Lease","namespace":"kueue-system","name":"c1f6bfd2.gke-internal.googlesource.com","uid":"72b48bf0-20e0-42a4-823b-2a6edcb3288a","apiVersion":"coordination.k8s.io/v1","resourceVersion":"264202"}, "reason": "LeaderElection", "message": "kueue-controller-manager-7ff7b759bf-nszmb_05445f7f-a871-4a4c-83c1-af075b850e49 became leader"}
1.6451685077062488e+09	INFO	controller.queue	Starting EventSource	{"reconciler group": "kueue.x-k8s.io", "reconciler kind": "Queue", "source": "kind source: *v1alpha1.Queue"}
1.645168507706281e+09	INFO	controller.queue	Starting Controller	{"reconciler group": "kueue.x-k8s.io", "reconciler kind": "Queue"}
1.6451685077062566e+09	INFO	controller.queuedworkload	Starting EventSource	{"reconciler group": "kueue.x-k8s.io", "reconciler kind": "QueuedWorkload", "source": "kind source: *v1alpha1.QueuedWorkload"}
1.6451685077063015e+09	INFO	controller.queuedworkload	Starting Controller	{"reconciler group": "kueue.x-k8s.io", "reconciler kind": "QueuedWorkload"}
1.6451685077062776e+09	INFO	controller.capacity	Starting EventSource	{"reconciler group": "kueue.x-k8s.io", "reconciler kind": "Capacity", "source": "kind source: *v1alpha1.Capacity"}
1.6451685077063189e+09	INFO	controller.capacity	Starting Controller	{"reconciler group": "kueue.x-k8s.io", "reconciler kind": "Capacity"}
1.6451685077064047e+09	INFO	controller.job	Starting EventSource	{"reconciler group": "batch", "reconciler kind": "Job", "source": "kind source: *v1.Job"}
1.6451685077064307e+09	INFO	controller.job	Starting EventSource	{"reconciler group": "batch", "reconciler kind": "Job", "source": "kind source: *v1alpha1.QueuedWorkload"}
1.6451685077064393e+09	INFO	controller.job	Starting Controller	{"reconciler group": "batch", "reconciler kind": "Job"}
1.6451685078075259e+09	INFO	controller.queuedworkload	Starting workers	{"reconciler group": "kueue.x-k8s.io", "reconciler kind": "QueuedWorkload", "worker count": 1}
1.6451685078075113e+09	INFO	controller.capacity	Starting workers	{"reconciler group": "kueue.x-k8s.io", "reconciler kind": "Capacity", "worker count": 1}
1.645168507807566e+09	INFO	controller.queue	Starting workers	{"reconciler group": "kueue.x-k8s.io", "reconciler kind": "Queue", "worker count": 1}
1.6451685078076618e+09	INFO	controller.job	Starting workers	{"reconciler group": "batch", "reconciler kind": "Job", "worker count": 1}
1.645168507807886e+09	LEVEL(-2)	job-reconciler	Job reconcile event	{"job": {"name":"ingress-nginx-admission-create","namespace":"kube-system"}}
1.645168507808418e+09	LEVEL(-2)	job-reconciler	Job reconcile event	{"job": {"name":"ingress-nginx-admission-patch","namespace":"kube-system"}}
1.6451685078085716e+09	LEVEL(-2)	job-reconciler	Job reconcile event	{"job": {"name":"kube-eventer-init-v1.6-a92aba6-aliyun","namespace":"kube-system"}}
1.6451706903900485e+09	LEVEL(-2)	capacity-reconciler	Capacity create event	{"capacity": {"name":"cluster-total"}}
1.6451706904384277e+09	LEVEL(-2)	queue-reconciler	Queue create event	{"queue": {"name":"main","namespace":"default"}}
1.6451707150770907e+09	LEVEL(-2)	job-reconciler	Job reconcile event	{"job": {"name":"sample-job-jjbq2","namespace":"default"}}
1.6451707150895817e+09	LEVEL(-2)	queued-workload-reconciler	QueuedWorkload create event	{"queuedWorkload": {"name":"sample-job-jjbq2","namespace":"default"}, "queue": "main", "status": "pending"}
1.645170715089716e+09	LEVEL(-2)	scheduler	Workload assumed in the cache	{"queuedWorkload": {"name":"sample-job-jjbq2","namespace":"default"}, "capacity": "cluster-total"}
1.6451707150901928e+09	LEVEL(-2)	job-reconciler	Job reconcile event	{"job": {"name":"sample-job-jjbq2","namespace":"default"}}
1.6451707150984285e+09	LEVEL(-2)	scheduler	Successfully assigned capacity and resource flavors to workload	{"queuedWorkload": {"name":"sample-job-jjbq2","namespace":"default"}, "capacity": "cluster-total"}
1.6451707150985863e+09	LEVEL(-2)	queued-workload-reconciler	QueuedWorkload update event	{"queuedWorkload": {"name":"sample-job-jjbq2","namespace":"default"}, "queue": "main", "capacity": "cluster-total", "status": "assigned", "prevStatus": "pending", "prevCapacity": ""}
1.6451707150986767e+09	LEVEL(-2)	job-reconciler	Job reconcile event	{"job": {"name":"sample-job-jjbq2","namespace":"default"}}

We can chose to switch to klog/v2.

Add info to ClusterQueue status

Suggestions:

admitted workloads
pending workloads
Capacity allocated
Capacity borrowed

/kind feature

Validating that flavors of a resource are different

What if we validate that the flavors of a resource in a capacity have at least on common label key with different values?

This practically forces that each flavor is pointing to different sets of nodes.

Add events that tracks a workload's status

Two possible locations to issue events:

when it is assigned a capacity in the scheduling loop.
in the job-controller when a corresponding workload is created.

/kind feature

Match workload tolerations with capacity taints

During workload scheduling, a workload's tolerations should be matched against the taints of the resource flavors. This allows a workload to opt-in to specific flavors.

/kind feature
/priority important-soon

Support kubeflow's MPIJob

That is kubeflow's mpi-operator. We could have started with other custom jobs, but this one seems important enough for our audience.

They currently don't have a suspend field, so we need to add it. Then, we program the controller based on the existing kueue job-controller.

/label feature
/size L
/priority important-longterm

Add info to Queue status

Suggestions:

Number of pending jobs
Number of started jobs
Resources currently used by the queue.

/kind feature

Ensure test cases are independent

In an effort to get a binary that "works", we wrote some tests where a test case depends on the state left by previous test cases.

This is problematic for debugging problems and it tends to lead to a lot of test changes when there is a behavior change or you want to insert a case in the middle of the existing ones.

Places that I'm aware of:

TestCacheCapacityOperations in https://github.com/kubernetes-sigs/kueue/blob/main/pkg/capacity/capacity_test.go
TestCacheWorkloadOperations in https://github.com/kubernetes-sigs/kueue/blob/main/pkg/capacity/capacity_test.go
Scheduler suite https://github.com/kubernetes-sigs/kueue/blob/main/test/integration/scheduler/scheduler_test.go

And there are similar situations in the following, but it's more like a single complex test case in each:

TestSnapshot in https://github.com/kubernetes-sigs/kueue/blob/main/pkg/capacity/snapshot_test.go
TestFIFOQueue in https://github.com/kubernetes-sigs/kueue/blob/main/pkg/queue/queue_test.go

/priority backlog

Brainstorm enhancing UX

We are adding more information to statuses of the various APIs we have (#7 and #5); but I am wondering what other UX-related enhancements we should pursue for the two personas: batch admin and batch user.

UX gets users excited about the system and I think should be a focal point as Kueue evolves.

Add user guide

/kind feature
/size M

Something more comprehensive that the existing README. Some of the use cases in bit.ly/kueue-apis can be dumped into samples/guides.

If possible, generate some documentation out of the APIs, similar to https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.23/

Contents (not necessarily each one will be a page, but they could be sections on existing pages).

Add integration tests for core controllers independent of the scheduler

Just to focus on the event handlers and status updates.
Keeping them independent of the scheduler should allow us to not depend on a specific queuing policy.
The test itself can do the assignments.

/kind cleanup

Use klog.KObj or klog.KRef for every log that involves a Queue or Capacity

This should help filter logs by namespace.
Although less important for Capacity (because it's ClusterScoped), I prefer to have everything uniform.

/kind cleanup

Integration test is flaky for status updates

see https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_kueue/90/pull-kueue-test-integration-main/1501214768197799936

controller.kubernetes.io/queue-name annotation not registered

The code in this repo uses an annotation, controller.kubernetes.io/queue-name, that is not registered in https://kubernetes.io/docs/reference/labels-annotations-taints/

We should either:

register and document the annotation
avoid specifying controller.kubernetes.io as the namespace for that annotation, and instead require specifying it as a command line option to the app. That way, end-users wouldn't assume that any particular namespace is expected.
use another namespace, that is appropriate for kueue.

Create an event and update workload status when failing to schedule a workload

Probably for the workloads that wasn't returned in this loop:

kueue/pkg/scheduler/scheduler.go

Line 142 in 9912f26

for _, w := range workloads {

Migrate GINKO framework to V2

Spin off #89 (comment)

We should consider upgrading ginko to v2 as stated in their logs

Learn more at: https://github.com/onsi/ginkgo/blob/ver2/docs/MIGRATING_TO_V2.md#removed-custom-reporters

for now we can work around by using ACK_GINKGO_DEPRECATIONS=1.16.5 env var when running the integration tests

Consider a diff image for testing/samples

I am running into

  Warning  Failed     11s   kubelet            Failed to pull image "perl": rpc error: code = Unknown desc = reading manifest latest in docker.io/library/perl: toomanyrequests: You have reached your pull rate limit. You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit

we might want to consider an image out of a diff registry to avoid this unfortunate error

/kind test

Enhance Makefile arguments for img building and pushing

Current Makefile doesn't provide flexible ways to modify how I want to build and push the image

VERSION := $(shell git describe --tags --dirty --always)
# Image URL to use all building/pushing image targets
IMAGE_BUILD_CMD ?= docker build
IMAGE_PUSH_CMD ?= docker push
IMAGE_BUILD_EXTRA_OPTS ?=
IMAGE_REGISTRY ?= k8s.gcr.io/kueue
IMAGE_NAME := controller
IMAGE_TAG_NAME ?= $(VERSION)
IMAGE_EXTRA_TAG_NAMES ?=
IMAGE_REPO ?= $(IMAGE_REGISTRY)/$(IMAGE_NAME)
IMAGE_TAG ?= $(IMAGE_REPO):$(IMAGE_TAG_NAME)
BASE_IMAGE_FULL ?= golang:1.17

Also in order to be more generic, rename
docker-image to simply image or image-build
docker-push to simply push or image-push

This provides more flexibility when developing in a non docker environment, like buildah , podman or even building the image with CI tool on kubernetes it self.

/kind feature

Add E2E test setup and first test

We can use a kind cluster.

The test should create basic Capacity, Queue and a batchv1/Job, and wait for it to complete.

Add scheduler integration tests

We have one that covers the job-controller on its own, we need a test that cover all other controllers together that includes creating queue, capacity and multiple jobs, and inspect that jobs are started as expected.

Add workload priority

This is a placeholder to discuss priority semantics.

We can have it at the workload level or queue level.

[Umbrella] ☂️ Requirements for release 0.1.0

Deadline: May 16th Kubecon EU

Issues that we need to complete to consider kueue ready for a first release:

Nice to have:

Add borrowing weight #62
E2E test #61
Use kueue.sigs.k8s.io API group #23
Support for one custom job #65

Efficient re-queueing of unschedulable workloads

Currently we relentlessly keep trying to schedule jobs.

We need to do something similar to what we did in the scheduler: re-queue based on capacity/workload/queue events.

/kind feature

Rename Capacity to ClusterQueue

Capacity not only defines usage limits for a set of tenants, but it is the level at which ordering will be done for workloads submitted to queues sharing a capacity.

Renaming Capacity to ClusterQueue could provide clarify, with Queue being the namespaced equivalent serving two purposes:

discoverability: tenants can simply list the queues that exist in their namespace to find which ones they can submit their workloads to, so it is simply a pointer to the cluster-scoped ClusterQueue.
address the use case where a tenant is running an experiment and want to define usage limits for that experiment; in this use case an experiment is modeled as a queue; which means tenants should be able to create/delete queues as they see fit.

Add scalability tests

This is critical to better understand kueue's limits and where its bottlenecks. We should check if there is a way to use clusterloader for this

	// 3. Calculate requirements for assigning workloads to capacities
	// (resource flavors, borrowing).
	entries := calculateRequirementsForAssignments(log, headWorkloads, snapshot)

kubernetes-sigs / kueue Goto Github PK

kueue's Introduction

Kueue

Features overview

Production Readiness status

Installation

Usage

Architecture

Roadmap

Community, discussion, contribution, and support

Code of conduct

kueue's People

Contributors

Stargazers

Watchers

Forkers

kueue's Issues

Recommend Projects

Recommend Topics

Recommend Org