kubernetes-sigs / kueue Goto Github PK
View Code? Open in Web Editor NEWKubernetes-native Job Queueing
Home Page: https://kueue.sigs.k8s.io
License: Apache License 2.0
Kubernetes-native Job Queueing
Home Page: https://kueue.sigs.k8s.io
License: Apache License 2.0
1.6451684909657109e+09 INFO controller-runtime.metrics Metrics server is starting to listen {"addr": "127.0.0.1:8080"}
1.6451684909663508e+09 INFO setup starting manager
1.6451684909665146e+09 INFO Starting server {"kind": "health probe", "addr": "[::]:8081"}
1.645168490966593e+09 INFO Starting server {"path": "/metrics", "kind": "metrics", "addr": "127.0.0.1:8080"}
I0218 07:14:51.066639 1 leaderelection.go:248] attempting to acquire leader lease kueue-system/c1f6bfd2.gke-internal.googlesource.com...
I0218 07:15:07.705977 1 leaderelection.go:258] successfully acquired lease kueue-system/c1f6bfd2.gke-internal.googlesource.com
1.6451685077060497e+09 DEBUG events Normal {"object": {"kind":"ConfigMap","namespace":"kueue-system","name":"c1f6bfd2.gke-internal.googlesource.com","uid":"e70e4b9b-54f4-4782-a904-e57d3001c8e6","apiVersion":"v1","resourceVersion":"264201"}, "reason": "LeaderElection", "message": "kueue-controller-manager-7ff7b759bf-nszmb_05445f7f-a871-4a4c-83c1-af075b850e49 became leader"}
1.6451685077061899e+09 DEBUG events Normal {"object": {"kind":"Lease","namespace":"kueue-system","name":"c1f6bfd2.gke-internal.googlesource.com","uid":"72b48bf0-20e0-42a4-823b-2a6edcb3288a","apiVersion":"coordination.k8s.io/v1","resourceVersion":"264202"}, "reason": "LeaderElection", "message": "kueue-controller-manager-7ff7b759bf-nszmb_05445f7f-a871-4a4c-83c1-af075b850e49 became leader"}
1.6451685077062488e+09 INFO controller.queue Starting EventSource {"reconciler group": "kueue.x-k8s.io", "reconciler kind": "Queue", "source": "kind source: *v1alpha1.Queue"}
1.645168507706281e+09 INFO controller.queue Starting Controller {"reconciler group": "kueue.x-k8s.io", "reconciler kind": "Queue"}
1.6451685077062566e+09 INFO controller.queuedworkload Starting EventSource {"reconciler group": "kueue.x-k8s.io", "reconciler kind": "QueuedWorkload", "source": "kind source: *v1alpha1.QueuedWorkload"}
1.6451685077063015e+09 INFO controller.queuedworkload Starting Controller {"reconciler group": "kueue.x-k8s.io", "reconciler kind": "QueuedWorkload"}
1.6451685077062776e+09 INFO controller.capacity Starting EventSource {"reconciler group": "kueue.x-k8s.io", "reconciler kind": "Capacity", "source": "kind source: *v1alpha1.Capacity"}
1.6451685077063189e+09 INFO controller.capacity Starting Controller {"reconciler group": "kueue.x-k8s.io", "reconciler kind": "Capacity"}
1.6451685077064047e+09 INFO controller.job Starting EventSource {"reconciler group": "batch", "reconciler kind": "Job", "source": "kind source: *v1.Job"}
1.6451685077064307e+09 INFO controller.job Starting EventSource {"reconciler group": "batch", "reconciler kind": "Job", "source": "kind source: *v1alpha1.QueuedWorkload"}
1.6451685077064393e+09 INFO controller.job Starting Controller {"reconciler group": "batch", "reconciler kind": "Job"}
1.6451685078075259e+09 INFO controller.queuedworkload Starting workers {"reconciler group": "kueue.x-k8s.io", "reconciler kind": "QueuedWorkload", "worker count": 1}
1.6451685078075113e+09 INFO controller.capacity Starting workers {"reconciler group": "kueue.x-k8s.io", "reconciler kind": "Capacity", "worker count": 1}
1.645168507807566e+09 INFO controller.queue Starting workers {"reconciler group": "kueue.x-k8s.io", "reconciler kind": "Queue", "worker count": 1}
1.6451685078076618e+09 INFO controller.job Starting workers {"reconciler group": "batch", "reconciler kind": "Job", "worker count": 1}
1.645168507807886e+09 LEVEL(-2) job-reconciler Job reconcile event {"job": {"name":"ingress-nginx-admission-create","namespace":"kube-system"}}
1.645168507808418e+09 LEVEL(-2) job-reconciler Job reconcile event {"job": {"name":"ingress-nginx-admission-patch","namespace":"kube-system"}}
1.6451685078085716e+09 LEVEL(-2) job-reconciler Job reconcile event {"job": {"name":"kube-eventer-init-v1.6-a92aba6-aliyun","namespace":"kube-system"}}
1.6451706903900485e+09 LEVEL(-2) capacity-reconciler Capacity create event {"capacity": {"name":"cluster-total"}}
1.6451706904384277e+09 LEVEL(-2) queue-reconciler Queue create event {"queue": {"name":"main","namespace":"default"}}
1.6451707150770907e+09 LEVEL(-2) job-reconciler Job reconcile event {"job": {"name":"sample-job-jjbq2","namespace":"default"}}
1.6451707150895817e+09 LEVEL(-2) queued-workload-reconciler QueuedWorkload create event {"queuedWorkload": {"name":"sample-job-jjbq2","namespace":"default"}, "queue": "main", "status": "pending"}
1.645170715089716e+09 LEVEL(-2) scheduler Workload assumed in the cache {"queuedWorkload": {"name":"sample-job-jjbq2","namespace":"default"}, "capacity": "cluster-total"}
1.6451707150901928e+09 LEVEL(-2) job-reconciler Job reconcile event {"job": {"name":"sample-job-jjbq2","namespace":"default"}}
1.6451707150984285e+09 LEVEL(-2) scheduler Successfully assigned capacity and resource flavors to workload {"queuedWorkload": {"name":"sample-job-jjbq2","namespace":"default"}, "capacity": "cluster-total"}
1.6451707150985863e+09 LEVEL(-2) queued-workload-reconciler QueuedWorkload update event {"queuedWorkload": {"name":"sample-job-jjbq2","namespace":"default"}, "queue": "main", "capacity": "cluster-total", "status": "assigned", "prevStatus": "pending", "prevCapacity": ""}
1.6451707150986767e+09 LEVEL(-2) job-reconciler Job reconcile event {"job": {"name":"sample-job-jjbq2","namespace":"default"}}
We can chose to switch to klog/v2.
Current Makefile doesn't provide flexible ways to modify how I want to build and push the image
VERSION := $(shell git describe --tags --dirty --always)
# Image URL to use all building/pushing image targets
IMAGE_BUILD_CMD ?= docker build
IMAGE_PUSH_CMD ?= docker push
IMAGE_BUILD_EXTRA_OPTS ?=
IMAGE_REGISTRY ?= k8s.gcr.io/kueue
IMAGE_NAME := controller
IMAGE_TAG_NAME ?= $(VERSION)
IMAGE_EXTRA_TAG_NAMES ?=
IMAGE_REPO ?= $(IMAGE_REGISTRY)/$(IMAGE_NAME)
IMAGE_TAG ?= $(IMAGE_REPO):$(IMAGE_TAG_NAME)
BASE_IMAGE_FULL ?= golang:1.17
Also in order to be more generic, rename
docker-image
to simply image
or image-build
docker-push
to simply push
or image-push
This provides more flexibility when developing in a non docker environment, like buildah
, podman
or even building the image with CI tool on kubernetes
it self.
/kind feature
Suggestions:
/kind feature
Just to focus on the event handlers and status updates.
Keeping them independent of the scheduler should allow us to not depend on a specific queuing policy.
The test itself can do the assignments.
/kind cleanup
What if we validate that the flavors of a resource in a capacity have at least on common label key with different values?
This practically forces that each flavor is pointing to different sets of nodes.
I am running into
Warning Failed 11s kubelet Failed to pull image "perl": rpc error: code = Unknown desc = reading manifest latest in docker.io/library/perl: toomanyrequests: You have reached your pull rate limit. You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit
we might want to consider an image out of a diff registry to avoid this unfortunate error
/kind test
Spin off #89 (comment)
We should consider upgrading ginko to v2 as stated in their logs
for now we can work around by using ACK_GINKGO_DEPRECATIONS=1.16.5
env var when running the integration tests
We should not allow any flakiness in our unit tests. The prow job should run the tests with -race -count 3
Leave the option in the Makefile to run the tests only once (by default), as it's likely useful during development.
/priority important-soon
/kind cleanup
During workload scheduling, a workload's node affinities and selectors should be matched against the labels of the resource flavors. This allows a workload to specify which exact flavors to use, or even force a different evaluation order of the flavors than that defined by the capacity.
/kind feature
We have one that covers the job-controller on its own, we need a test that cover all other controllers together that includes creating queue, capacity and multiple jobs, and inspect that jobs are started as expected.
During workload scheduling, a workload's tolerations should be matched against the taints of the resource flavors. This allows a workload to opt-in to specific flavors.
/kind feature
/priority important-soon
Two possible locations to issue events:
/kind feature
Currently we relentlessly keep trying to schedule jobs.
We need to do something similar to what we did in the scheduler: re-queue based on capacity/workload/queue events.
/kind feature
Deadline: May 16th Kubecon EU
Issues that we need to complete to consider kueue ready for a first release:
Nice to have:
Preemption can be useful to reclaim borrowed capacity, however the obvious tradeoff is interrupting workloads and potentially losing significant progress.
There are two high-level design decision we need to make and whether they should be tunable:
This is lower priority than #65, but it would be good to have an integration with a workflow framework.
Argo supports the suspend flag, the tricky part is that suspend is for the whole workflow, meaning a QueuedWorkload would need to represent the resources of the whole workflow all at once.
Ideally Argo should create jobs per sequential step, and then resource reservation happens one step at a time.
A capacity can borrow resources from flavors matching the names of ones defined in the capacity. Those flavors with matching names should also have identical labels and taints.
One solution is to define a cluster-scoped object API that represents resource flavors that capacities refer to by name when setting a quota. It would look like this:
type ResourceFlavorSpec struct {
// the object name serves as the flavor name, e.g., nvidia-tesla-k80.
// resource is the resource name, e.g., nvidia.com/gpus.
Resource v1.ResourceName
// labels associated with this flavor. Those labels are matched against or
// converted to node affinity constraints on the workload’s pods.
// For example, cloud.provider.com/accelerator: nvidia-tesla-k80.
Labels map[string]string
// taints associated with this constraint that workloads must explicitly
// “tolerate” to be able to use this flavor.
// e.g., cloud.provider.com/preemptible="true":NoSchedule
Taints []Taint
}
This will avoid duplicating labels/taints on each capacity and so makes it easier to create a cohort of capacities with similar resources.
The downside is of course now we have another resource that the batch admin needs to deal with. But I expect that the number of flavors will typically be small.
In an effort to get a binary that "works", we wrote some tests where a test case depends on the state left by previous test cases.
This is problematic for debugging problems and it tends to lead to a lot of test changes when there is a behavior change or you want to insert a case in the middle of the existing ones.
Places that I'm aware of:
And there are similar situations in the following, but it's more like a single complex test case in each:
/priority backlog
We can use a kind cluster.
The test should create basic Capacity, Queue and a batchv1/Job, and wait for it to complete.
Suggestions:
/kind feature
Hello fellow HPC and batch enthusiasts, I have read your public doc with much interest and I have seen that the GPU is mentioned a couple of times. To make kueue and GPUs a success story I think we need to align the requirements that kueue needs for scheduling with our k8s stack which should expose the right information that you need to make the right scheduling decisions.
There are dedicated GPUs, MIG slices, vGPU either time shared or MIG backed, those are all features that need to be taken into consideration. Going further if we're doing multi-node with MPI and such, we need to think also about network topologies and node interconnects. You may rather use nodes that have GPUDirect enabled than nodes that have "only" a GPU with a slow ethernet connection.
I am one of the tech-leads for accelerator enablement on Kubernetes at NVIDIA and I am happy to help to move this forward.
Systems like Yarn allow creating a hierarchy of fair sharing, which allows modeling deeper organizational structures with fair-sharing.
Kueue currently supports three organizational levels: Cohort (models a business unit), ClusterQueue (models divisions within a business unit), namespace (models teams within a division). However fair-sharing is only supported at one level, within a cohort.
We opted-out of supporting hierarchy from the beginning for two reasons: (1) it adds complexity to both the API and implementation; (2) it is also not clear that in practice customers need more than two levels of sharing which is what the current model enables and seems to work for other frameworks like Slurm and HTCondor.
As Kueue evolves we likely need to revisit this decision.
bit.ly/kueue-apis defined a weight to dynamically set a borrowing ceiling for each Capacity, based on the total resources in the Cohort and the capacities that have pending workloads.
We need to implement such behavior and remove the ceiling.
The weights and unused resources should lead to a dynamic ceiling that is calculated in every scheduling cycle. The exact semantics of this calculation are not fully understood.
In a given scheduling cycle, which capacities are considered for splitting the unused resources? Only the ones with pending jobs? What about the ones that are already borrowing but have no more pending jobs? What is considered unused resources once some resources have already being borrowed?
There are probably a few interpretations to these questions that lead to slightly different results. We need to explore them and pick one that sounds more reasonable or is based on existing systems.
To prevent users from hijacking a Capacity by creating multiple Queues, we should have a single heap for a Capacity (spin off from #80 (comment))
/kind feature
TotalRequests in workload.Info is currently a map, when iterating over it to assign resources, we will loose the original order of the podsets.
When scheduling, a podset gets assigned flavors depending on the iteration order of this map, and so the assignment will not be deterministic.
This actually caused a flake in the following test:
kueue/pkg/scheduler/scheduler_test.go
Line 406 in b86eeb1
/kind bug
The code in this repo uses an annotation, controller.kubernetes.io/queue-name
, that is not registered in https://kubernetes.io/docs/reference/labels-annotations-taints/
We should either:
controller.kubernetes.io
as the namespace for that annotation, and instead require specifying it as a command line option to the app. That way, end-users wouldn't assume that any particular namespace is expected.NamespaceSelector in capacity allows controlling which namespaces are allowed to use the capacity.
/kind feature
Currently a job's resources are reclaimed by Kueue only when the whole job finishes; for jobs with multiple pods, this entails waiting until the last pod finishes. This is not efficient as the pods of a parallel job may have laggards consuming little resources compared to the overall job.
One solution is to continuously update the Workload object with the number of completed pods so that Kueue can gradually reclaim the resources of those pods.
We need a simple framework to support different policies or algorithms for every phases in Job scheduling
.
/kind feature
/cc @ahg-g @alculquicondor
/kind feature
For better organization, also split the integration tests in a dedicated folder.
This should help filter logs by namespace.
Although less important for Capacity (because it's ClusterScoped), I prefer to have everything uniform.
/kind cleanup
That is kubeflow's mpi-operator. We could have started with other custom jobs, but this one seems important enough for our audience.
They currently don't have a suspend field, so we need to add it. Then, we program the controller based on the existing kueue job-controller.
/label feature
/size L
/priority important-longterm
This is critical to better understand kueue's limits and where its bottlenecks. We should check if there is a way to use clusterloader for this
We should have a clear path towards support spark and other dynamically sized jobs. Another example of this is Ray.
One related aspect is to support dynamically updating the resource requirements of a workload, we can probably limit that to support changing the count of a PodSet in QueuedWorkload (in Spark, the number of workers could change during the runtime of the job, but not the resource requirements of a worker).
One idea is to model it in a way similar to "in-place update to pod resources" [1], but in our case it would be the count that is mutable. The driver pod in spark would be watching for the corresponding QueuedWorkload instance and adjusts the number of workers when the new count is admitted.
Currently ClusterQueue supports usage limits at a specific point in time. A common use case is for batch admins to set up budgets, meaning usage limits over periods of time; for example, x cores over over a period of one month.
Currently, this would be very cumbersome due to the lack of support from kubebuilder kubernetes-sigs/controller-tools#656
Once the support is added and we are ready to publish a v1beta1, we should consider renaming the api group. Note that this requires an official api-review kubernetes/enhancements#1111
Summary doc: https://docs.google.com/document/d/1Uu4hfGxux4Wh_laqZMLxXdEVdty06Sb2DwB035hj700/edit?usp=sharing&resourcekey=0-b7mU7mGPCkEfhjyYDsXOBg (join https://groups.google.com/a/kubernetes.io/g/wg-batch to access)
Potential changes when graduating:
admission
from Workload spec into status (from #498)min
, max
into something easier to understand.ObjectMeta
into each PodSet
template.We'd better rename these variables to avoid unnecessary troubles. Such as cap/copy/new and so on.
/kind cleanup
This can be done in parallel
kueue/pkg/scheduler/scheduler.go
Lines 74 to 76 in f3b25fd
We should check what else can be parallelized and set the number of threads via configuration.
/kind feature
As the title
/kind bug
This is a placeholder to discuss priority semantics.
We can have it at the workload level or queue level.
/kind feature
/size M
Something more comprehensive that the existing README. Some of the use cases in bit.ly/kueue-apis can be dumped into samples/guides.
If possible, generate some documentation out of the APIs, similar to https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.23/
Contents (not necessarily each one will be a page, but they could be sections on existing pages).
Probably for the workloads that wasn't returned in this loop:
kueue/pkg/scheduler/scheduler.go
Line 142 in 9912f26
After running golang-ci and gofmt, the following is showed
pkg/capacity/capacity_test.go:37:19: Error return value is not checked (errcheck)
kueue.AddToScheme(scheme)
^
pkg/capacity/capacity_test.go:66:23: Error return value of `cache.AddCapacity` is not checked (errcheck)
cache.AddCapacity(context.Background(), &c)
^
pkg/capacity/capacity_test.go:89:26: Error return value of `cache.UpdateCapacity` is not checked (errcheck)
cache.UpdateCapacity(&c)
^
pkg/capacity/capacity_test.go:203:19: Error return value is not checked (errcheck)
kueue.AddToScheme(scheme)
^
pkg/capacity/snapshot_test.go:38:19: Error return value is not checked (errcheck)
kueue.AddToScheme(scheme)
^
pkg/capacity/snapshot_test.go:122:20: Error return value of `cache.AddCapacity` is not checked (errcheck)
cache.AddCapacity(context.Background(), &cap)
^
pkg/queue/manager.go:61:21: Error return value of `qImpl.setProperties` is not checked (errcheck)
qImpl.setProperties(q)
^
pkg/queue/manager_test.go:394:19: Error return value of `manager.AddQueue` is not checked (errcheck)
manager.AddQueue(ctx, &q)
^
pkg/queue/manager_test.go:452:20: Error return value of `manager.AddQueue` is not checked (errcheck)
manager.AddQueue(ctx, &q)
^
pkg/queue/manager_test.go:462:19: Error return value of `manager.AddQueue` is not checked (errcheck)
manager.AddQueue(ctx, &q)
^
pkg/scheduler/scheduler.go:200:32: Error return value of `s.capacityCache.AssumeWorkload` is not checked (errcheck)
s.capacityCache.AssumeWorkload(newWorkload)
^
pkg/capacity/capacity.go:120:2: S1023: redundant `return` statement (gosimple)
return
^
pkg/capacity/snapshot_test.go:292:4: SA9003: empty branch (staticcheck)
if m == nil {
^
make: *** [Makefile:73: ci-lint] Error 1
Since the scheduler works on a snapshot, it's possible that a workload is deleted between the time we get it from a queue and when we assume it.
We should check the client cache before Assuming a workload to make sure it still exists.
Also, when a workload is deleted, we should clear the cache even if the workload API object is not assigned (regardless of DeleteStateUnknown). This is because the workload could be deleted between the time the scheduler Assumes a workload and it updates the assignment in the API.
/kind bug
We don't necessarily need to wait for a production-ready version. We can publish alpha/beta builds
/kind feature
With regards to CQ deletions, perhaps we can inject finalizers to block the delete until all running workloads finish, at the same time stop admitting new workloads.
What about CQ updates? One simple solution is to make everything immutable, and so updating a CQ is only possible by recreating it, hence we reduce update to a delete which we already handled above. We can relax this a little by allowing the following updates:
all of those updates don't impact running workloads and can be done without checking for current usage levels.
/kind feature
A queued workload can be pending for several reasons:
We need to find a way to set this information.
Probably the first 2 can happen in the queuedworkload_controller, after every update.
The other 2 should probably during scheduling.
/kind feature
Capacity not only defines usage limits for a set of tenants, but it is the level at which ordering will be done for workloads submitted to queues sharing a capacity.
Renaming Capacity to ClusterQueue could provide clarify, with Queue being the namespaced equivalent serving two purposes:
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.