kubedl-io / kubedl Goto Github PK

Run your deep learning workloads on Kubernetes more easily and efficiently.

License: Apache License 2.0

Dockerfile 0.05% Makefile 0.19% Go 74.69% Shell 0.11% Smarty 0.10% JavaScript 23.24% Less 1.02% TypeScript 0.06% EJS 0.26% Python 0.28%

container deep-learning inference kubernetes machine-learning model scheduling

kubedl's People

Contributors

Stargazers

Watchers

kubedl's Issues

[M+] dashboard for users to manage and manipulate jobs.

for KubeDL users, they can directly manipulate jobs via kubectl or arena in terminal, however, it has learning curve for kubernetes beginners, if possible dashboard can be a more user-kindly way to manage and manipulate jobs.

No entry for users to pass customized variables when deploy kubedl.

For example, user may want to apply different docker registry prefix of kubedl image, it varies from different deploy regions and should be configurable, is it considerable to integrate with helm chars ?

[M+] schedule regularly training jobs in cron style and described in CRD.

setup training jobs in cron style (e.g. linux cron-tab expression) and trigger scheduling jobs regularly is a common practice, KubeDL should design a special workload to control cron training jobs.

[M+] Model Management And Version Tracking.

Model lineage and versioning to track the history of a model natively in CRD: when the model is trained using which data and which image, each version of the model, which version is runnng etc.

fix the dockerfile

remove the CMD ["/backend-server"] in the main dockerfile

commonize the implementation in service.go pod.go

some implementation in service.go and pod.go can be shared among the job types

[feature request] secure a CII Best Practices badge

https://bestpractices.coreinfrastructure.org/en

For CNCF requirements

support scale marsjob workers by 'kubectl scale'

TestJob CRD doesn't need to be in installation folder

https://github.com/alibaba/kubedl/blob/master/config/crd/bases/kubeflow.org_testjobs.yaml

I think the test job CRD does not need to exist in the installation folder?

no eviction information exposed on job object and it's user concerned

pod/job eviction status may should be aggregated and shown on JobStatus, as well as metrics reporting.

Document the metrics

The purpose is to document what metrics exists, and what does the metrics mean.

There's no point in documenting how to add a metrics

https://github.com/alibaba/kubedl/blob/cd837b271037a9a11afeab75b3a0dca31465d5fc/docs/metrics.md#L10

What should I do if I want to change the Startup flags？

When installing the kubedl, I want to change the value of the max-reconciles. However, I do not find the corresponding value in the values.yaml. Should I try to use the --set flag or to change the source code?

Change signature the of UpdateJobStatus interface ?

In the previous ControllerInterface design, the UpdateJobStatus method is as follows:

// UpdateJobStatus updates the job status and job conditions
UpdateJobStatus(job interface{}, replicas map[ReplicaType]*ReplicaSpec, jobStatus *JobStatus) error

when its comes to decide whether a job should restart or not, we check the pod exit code and phase，in tf-operator implementation restart, a restart bool is used and can be passed to updateSingleJobStatus, but now limited by interface design, we can only determine whether to restart by simply compare restartpolicy, which may cause a bug. In another way we can recompute the restart flag by iterating all pods again, but it quite inefficient.

so I propose to change the UpdateJobStatus signature to:

UpdateJobStatus(job interface{}, replicas map[ReplicaType]*ReplicaSpec, jobStatus *JobStatus, restart bool) error`

and the restarting mechanism is required by all workload controllers.

Can't update CRD status successfully in v1.16+ k8s cluster

It's seems in v1.16+ k8s cluster, KubeDL can not reconcile jobs successfully, and the log shows Pods and Services can be created, but when update job status to api-server, reflected in code as follows:

r.Status().Update(context.Background(), jobCpy)

it always throw and error says can not found xxx job. But in normal v1.14 k8s cluster it works fine. This issue reproduces in AliCloud v1.16.2 cluster and minikube v1.17.0 cluster, and the root cause may be in v1.16+ k8s, CRD has moved into GA state, and its CRD-related components has some improvements.

ref: https://book-v1.book.kubebuilder.io/basics/status_subresource.html

no `BackoffCounts` recorded in any field of JobStatus or other data structure in memory.

In these controller-runtime based operator implementations, the work queue is well encapsulated and be transparent to operator developers. But in some scenarios, we still need some informations exposed by work queue, e.g:
we rely the NumRequeues(jobKey) method of work queue to retrieve the previous number of retries of this job, and determine whether retry times exceed the backoff limit.

However, we can't get this information because that work queue is encapsulated inside controller-runtime. I've tried to open an issue kubernetes-sigs/controller-runtime#686 to report this problem, but it seems still not elegant.

so I propose to add a new filed named BackoffCounts in common JobStatus definition, to record backoff times directly in status:

type JobStatus struct {
    // ...
    
    // BackffCounts record retry times for this job.
    BackoffCounts    int32

    // ...
}

for now, we create a FakeQueue to adapt the JobController, it actually does nothing... if the backoff info can be exposed by this new field, the FakeQueue can be removed.

xdljobs worker print log issue

xdljobs 每个work 代码运行日志打印不是每个step打印一次，而是运行完成后一次性返回所有的打印日志，有人碰到过这是什么情况吗

Case-sensitivity of replica type makes it user-unfriendly

For example, if I mistype Worker to worker when submit a TFJob, it can not be successfully reconciled, so it is more user-friendly to ignore the case of replica types.

[M+] design new workloads controller to support serving frameworks.

KubeDL expects to integrate natively with the training and serving CRD to enable the full automation with controller, and support serving frameworks like TFServing and Nvidia Triton.

Create daily docker image build

What would you like to be added:
Run the docker image build for kubedl daily

Why is this needed:

add parameter for model image bulider paramter

The max concurrent workers of controllers are not configurable.

the number of workers in each controller are set to 1 defaulted, but not configurable.

Fix the metrics namings

https://github.com/alibaba/kubedl/blob/cd837b271037a9a11afeab75b3a0dca31465d5fc/pkg/metrics/counter.go#L40

follow prometheus standard metrics namings https://prometheus.io/docs/practices/naming/#metric-names

such as:

kubedl_jobs_created_total

Fix UT failures

The UT doesn't pass

Fix the authentication config option and some readme for dashboard

[BUG] kubedl manager startup fails

What happened:
kubedl manager startup fails like below

I0723 23:28:46.174606 1 controllers.go:42] workload Inference controller has started.
2021-07-23T23:28:46.174Z INFO setup setting up storage backends
F0723 23:28:46.174685 1 client.go:32] get clientMgr fail, clientMgr is nil

What you expected to happen:

How to reproduce it:

Anything else we need to know?:

Environment:

KubeDL version:
Kubernetes version (use kubectl version):
OS (e.g: cat /etc/os-release):
Kernel (e.g. uname -a):
Install tools:
Others:

[M5] integrate `coscheduling` based Gang plugin.

the sig-scheduling community implement gang-scheduling semantic by extending scheduling-framework, which is called coscheduling and code is maintained @ https://github.com/kubernetes-sigs/scheduler-plugins/tree/master/pkg/coscheduling

KubeDL can implement a gang-scheduling plugin based on coscheduling.

[M+] Tuning: atomatic tuning for inference worload in its container resource and performance.

KubeDL auto tuning for inference tries to figure out the best configurations for inference container with respect to its container resources or running parameteres .

support oss as the backend for model storage

GetImageConfig should return default configs for the images

Here we can return the default image, instead of return empty image

func (h *KubeDLHandler) GetImageConfig() *ImageConfig {
	cm := &v1.ConfigMap{}
	err := h.client.Get(context.Background(), types.NamespacedName{
		Namespace: constants.KubeDLSystemNamespace,
		Name:      constants.KubeDLConsoleConfig,
	}, cm)
	if err != nil {
		glog.Infof("get image config error: %v", err)
		return &ImageConfig{}
	}

TAG-Runtime presentation/discussion

Hello kubeDL team,

I'm one of the co-chairs of the CNCF TAG-Runtime. I'm reaching out and I think it would be great for you to present/discuss the project at one of our meetings. For example, a general overview of the project or/and a demo.

Feel free to add it to our agenda or reach out to me (raravena80 at gmail.com)

Thanks!

[feature request] create a SECURITY.md

A security disclosure process would be nice per CNCF requirements!

example of pytorch job has error value about rank num

No CI/CD pipeline for this repo currently.

Create a dockerfile for the front and backend

Create a dockerfile for the fornt and backend server container

upgrade gang-scheduler implementation of kube-batch to volcano.

As we all know, volcano scheduler is developed based on kube-batch v1alpha2, and currently it's recommended by community instead of kube-batch, volcano is being actively contributed while kube-batch has not been updated for quite a long time.
And a main motivation of upgradation is that: dependencies of kube-batch stays on k8s v1.13 and it's been stale, and conflicts with other k8s dependencies in higher version, volcano is free from these trouble.

unable to start manager when setting -enable-leader-election=true

When running the manager:

go run ./main.go -enable-leader-election=true

I met this error log:

2020-10-12T13:12:47.578+0800	ERROR	setup	unable to start manager	{"error": "LeaderElectionID must be configured"}
github.com/go-logr/zapr.(*zapLogger).Error
	/Users/qijun/go/pkg/mod/github.com/go-logr/[email protected]/zapr.go:128
main.main
	/Users/qijun/works/kubedl/main.go:77
runtime.main
	/usr/local/Cellar/go/1.14.5/libexec/src/runtime/proc.go:203
exit status 1

controllers fetch latest job by client but the status was stale in case of etcd slow watch

We found an data-consistency anomaly between local informer cache and etcd in case of etcd slow watch, we can reproduce this in a kubernetes cluster with unstable networking, for example:

one master pod of pytorchjob job-1 failed;
pytorch controller update job-1 as Failed and apiserver response with success;
another reconciling starts but latest fetched job status is still Running, which is unexpected;

the reason is that READ requests(get/list resources) hits local informer cache by default, but the cached object will only be updated after update event broadcasted from etcd, however, if slow watch happened and event broadcasting delayed, the local cache will be stale.

   Controller               Local-Informer      ApiServer
     failed                    running           running
      
1.   |-------------------------------------->   failed
2.   | <------------------->  running            |
3.   |                        failed  <--------  |
4.   | <------------------>   failed             |

Add a documentation for cron controller

@SimonCqk can you add a documentation for the cron controller

Support custom kaniko image

support using custom kaniko image with configmap

Fix Kubedl Readme Example Job

The example file doesn't exist in the repo, make sure the example job works in a mini cluster

Fix inefficient metrics calculation

currently, the operator wakes up periodically to check sum all the jobs running, this is inefficient.

error when run 'make install' or 'make deploy'

/Users/qiukaichen/go/bin/controller-gen "crd" rbac:roleName=manager-role webhook paths="./apis/..." paths="./controllers/..." output:crd:artifacts:config=config/crd/bases
kustomize build config/crd | kubectl apply -f -
customresourcedefinition.apiextensions.k8s.io/xdljobs.xdl.kubedl.io created
customresourcedefinition.apiextensions.k8s.io/xgboostjobs.xgboostjob.kubeflow.org created
Error from server (Invalid): error when creating "STDIN": CustomResourceDefinition.apiextensions.k8s.io "pytorchjobs.kubeflow.org" is invalid: spec.validation.openAPIV3Schema.properties[metadata]: Forbidden: must not specify anything other than name and generateName, but metadata is implicitly specified
Error from server (Invalid): error when creating "STDIN": CustomResourceDefinition.apiextensions.k8s.io "tfjobs.kubeflow.org" is invalid: spec.validation.openAPIV3Schema.properties[metadata]: Forbidden: must not specify anything other than name and generateName, but metadata is implicitly specified
make: *** [install] Error 1

Write a documentation for running Mars Job on K8s

pytorch job seems to succeed but data is not completed

Dear, all

When I used kubedl in managing distributed pytorch jobs, I found that sometimes the output data of a pytorch job cannot be fully saved to the storage system, however, the pytorch job will still claim itself as successful.

To understand that, I check the detailed logs of each worker.
Logs show that after the master role of the pytorch job gets completed, the whole pytorch job becomes terminated, regardless of the running pods of the other pytorch workers. I think it should be the reason for the incomplete output data.

It seems to be a bug in my view. Could you please take a check?

Thanks,
Wencong

[M5/Feature Request] Orchestrating Job Roles in DAG Scheduling Scheme.

In the practice of our prod environment, we found a portion of scenarios relies on scheduling job replicas in stages, otherwise there will be severe exceptions, for example:

for PyTorchJob, if Workers step into Running phase before Master (Master pod may hanging on pulling images or some other reasons), it will crash immediately because Worker cannot ping Master successfully, then Job goes to Failed.
for MPIJob, if Launcher step into Running phase before Workers (Worker pod may hanging on pulling images), Launcher will exit unexpected because kubectl exec command can not reach target container in each Worker, similarly, Job finally failed.

in addition, DAG scheduling can also improve efficiency in certain scenarios, for example: schedule PS before Worker for TFJob to reduce the duration of worker-stall;

Recommended upgrade version：v1.18.15

kubedl-io / kubedl Goto Github PK

kubedl's People

Contributors

Stargazers

Watchers

Forkers

kubedl's Issues

Recommend Projects

Recommend Topics

Recommend Org