Giter VIP home page Giter VIP logo

kubedl-io / kubedl Goto Github PK

View Code? Open in Web Editor NEW
497.0 497.0 78.0 36.9 MB

Run your deep learning workloads on Kubernetes more easily and efficiently.

Home Page: https://kubedl.io/

License: Apache License 2.0

Dockerfile 0.05% Makefile 0.19% Go 74.69% Shell 0.11% Smarty 0.10% JavaScript 23.24% Less 1.02% TypeScript 0.06% EJS 0.26% Python 0.28%
container deep-learning inference kubernetes machine-learning model scheduling

kubedl's People

Contributors

13241308289 avatar aerok avatar alibaba-oss avatar arugal avatar basit9958 avatar bourbonwang avatar ccchenjiahuan avatar cjiee avatar dependabot[bot] avatar diaozhongpu avatar fossabot avatar hegaoyuan avatar hoaresky avatar jian-he avatar kerthcet avatar qijune avatar qisikai avatar sbdtu5498 avatar shikanon avatar shinytang6 avatar simoncqk avatar testwill avatar tzzcfrank avatar yhalpha avatar zjchenn avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

kubedl's Issues

[M+] dashboard for users to manage and manipulate jobs.

for KubeDL users, they can directly manipulate jobs via kubectl or arena in terminal, however, it has learning curve for kubernetes beginners, if possible dashboard can be a more user-kindly way to manage and manipulate jobs.

[M+] Model Management And Version Tracking.

Model lineage and versioning to track the history of a model natively in CRD: when the model is trained using which data and which image, each version of the model, which version is runnng etc.

Change signature the of UpdateJobStatus interface ?

In the previous ControllerInterface design, the UpdateJobStatus method is as follows:

// UpdateJobStatus updates the job status and job conditions
UpdateJobStatus(job interface{}, replicas map[ReplicaType]*ReplicaSpec, jobStatus *JobStatus) error

when its comes to decide whether a job should restart or not, we check the pod exit code and phase,in tf-operator implementation restart, a restart bool is used and can be passed to updateSingleJobStatus, but now limited by interface design, we can only determine whether to restart by simply compare restartpolicy, which may cause a bug. In another way we can recompute the restart flag by iterating all pods again, but it quite inefficient.

so I propose to change the UpdateJobStatus signature to:

UpdateJobStatus(job interface{}, replicas map[ReplicaType]*ReplicaSpec, jobStatus *JobStatus, restart bool) error`

and the restarting mechanism is required by all workload controllers.

Can't update CRD status successfully in v1.16+ k8s cluster

It's seems in v1.16+ k8s cluster, KubeDL can not reconcile jobs successfully, and the log shows Pods and Services can be created, but when update job status to api-server, reflected in code as follows:

r.Status().Update(context.Background(), jobCpy)

it always throw and error says can not found xxx job. But in normal v1.14 k8s cluster it works fine. This issue reproduces in AliCloud v1.16.2 cluster and minikube v1.17.0 cluster, and the root cause may be in v1.16+ k8s, CRD has moved into GA state, and its CRD-related components has some improvements.

ref: https://book-v1.book.kubebuilder.io/basics/status_subresource.html

no `BackoffCounts` recorded in any field of JobStatus or other data structure in memory.

In these controller-runtime based operator implementations, the work queue is well encapsulated and be transparent to operator developers. But in some scenarios, we still need some informations exposed by work queue, e.g:
we rely the NumRequeues(jobKey) method of work queue to retrieve the previous number of retries of this job, and determine whether retry times exceed the backoff limit.

However, we can't get this information because that work queue is encapsulated inside controller-runtime. I've tried to open an issue kubernetes-sigs/controller-runtime#686 to report this problem, but it seems still not elegant.

so I propose to add a new filed named BackoffCounts in common JobStatus definition, to record backoff times directly in status:

type JobStatus struct {
    // ...
    
    // BackffCounts record retry times for this job.
    BackoffCounts    int32

    // ...
}

for now, we create a FakeQueue to adapt the JobController, it actually does nothing... if the backoff info can be exposed by this new field, the FakeQueue can be removed.

xdljobs worker print log issue

xdljobs 每个work 代码运行日志打印不是每个step打印一次,而是运行完成后一次性返回所有的打印日志,有人碰到过这是什么情况吗

[BUG] kubedl manager startup fails

What happened:
kubedl manager startup fails like below

I0723 23:28:46.174606 1 controllers.go:42] workload Inference controller has started.
2021-07-23T23:28:46.174Z INFO setup setting up storage backends
F0723 23:28:46.174685 1 client.go:32] get clientMgr fail, clientMgr is nil

What you expected to happen:

How to reproduce it:

Anything else we need to know?:

Environment:

  • KubeDL version:
  • Kubernetes version (use kubectl version):
  • OS (e.g: cat /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:

GetImageConfig should return default configs for the images

Here we can return the default image, instead of return empty image

func (h *KubeDLHandler) GetImageConfig() *ImageConfig {
	cm := &v1.ConfigMap{}
	err := h.client.Get(context.Background(), types.NamespacedName{
		Namespace: constants.KubeDLSystemNamespace,
		Name:      constants.KubeDLConsoleConfig,
	}, cm)
	if err != nil {
		glog.Infof("get image config error: %v", err)
		return &ImageConfig{}
	}

TAG-Runtime presentation/discussion

Hello kubeDL team,

I'm one of the co-chairs of the CNCF TAG-Runtime. I'm reaching out and I think it would be great for you to present/discuss the project at one of our meetings. For example, a general overview of the project or/and a demo.

Feel free to add it to our agenda or reach out to me (raravena80 at gmail.com)

Thanks!

upgrade gang-scheduler implementation of kube-batch to volcano.

As we all know, volcano scheduler is developed based on kube-batch v1alpha2, and currently it's recommended by community instead of kube-batch, volcano is being actively contributed while kube-batch has not been updated for quite a long time.
And a main motivation of upgradation is that: dependencies of kube-batch stays on k8s v1.13 and it's been stale, and conflicts with other k8s dependencies in higher version, volcano is free from these trouble.

unable to start manager when setting -enable-leader-election=true

When running the manager:

go run ./main.go -enable-leader-election=true

I met this error log:

2020-10-12T13:12:47.578+0800	ERROR	setup	unable to start manager	{"error": "LeaderElectionID must be configured"}
github.com/go-logr/zapr.(*zapLogger).Error
	/Users/qijun/go/pkg/mod/github.com/go-logr/[email protected]/zapr.go:128
main.main
	/Users/qijun/works/kubedl/main.go:77
runtime.main
	/usr/local/Cellar/go/1.14.5/libexec/src/runtime/proc.go:203
exit status 1

controllers fetch latest job by client but the status was stale in case of etcd slow watch

We found an data-consistency anomaly between local informer cache and etcd in case of etcd slow watch, we can reproduce this in a kubernetes cluster with unstable networking, for example:

  1. one master pod of pytorchjob job-1 failed;
  2. pytorch controller update job-1 as Failed and apiserver response with success;
  3. another reconciling starts but latest fetched job status is still Running, which is unexpected;

the reason is that READ requests(get/list resources) hits local informer cache by default, but the cached object will only be updated after update event broadcasted from etcd, however, if slow watch happened and event broadcasting delayed, the local cache will be stale.

   Controller               Local-Informer      ApiServer
     failed                    running           running
      
1.   |-------------------------------------->   failed
2.   | <------------------->  running            |
3.   |                        failed  <--------  |
4.   | <------------------>   failed             |

error when run 'make install' or 'make deploy'

/Users/qiukaichen/go/bin/controller-gen "crd" rbac:roleName=manager-role webhook paths="./apis/..." paths="./controllers/..." output:crd:artifacts:config=config/crd/bases
kustomize build config/crd | kubectl apply -f -
customresourcedefinition.apiextensions.k8s.io/xdljobs.xdl.kubedl.io created
customresourcedefinition.apiextensions.k8s.io/xgboostjobs.xgboostjob.kubeflow.org created
Error from server (Invalid): error when creating "STDIN": CustomResourceDefinition.apiextensions.k8s.io "pytorchjobs.kubeflow.org" is invalid: spec.validation.openAPIV3Schema.properties[metadata]: Forbidden: must not specify anything other than name and generateName, but metadata is implicitly specified
Error from server (Invalid): error when creating "STDIN": CustomResourceDefinition.apiextensions.k8s.io "tfjobs.kubeflow.org" is invalid: spec.validation.openAPIV3Schema.properties[metadata]: Forbidden: must not specify anything other than name and generateName, but metadata is implicitly specified
make: *** [install] Error 1

pytorch job seems to succeed but data is not completed

Dear, all

When I used kubedl in managing distributed pytorch jobs, I found that sometimes the output data of a pytorch job cannot be fully saved to the storage system, however, the pytorch job will still claim itself as successful.

To understand that, I check the detailed logs of each worker.
Logs show that after the master role of the pytorch job gets completed, the whole pytorch job becomes terminated, regardless of the running pods of the other pytorch workers. I think it should be the reason for the incomplete output data.

It seems to be a bug in my view. Could you please take a check?

Thanks,
Wencong

[M5/Feature Request] Orchestrating Job Roles in DAG Scheduling Scheme.

In the practice of our prod environment, we found a portion of scenarios relies on scheduling job replicas in stages, otherwise there will be severe exceptions, for example:

  1. for PyTorchJob, if Workers step into Running phase before Master (Master pod may hanging on pulling images or some other reasons), it will crash immediately because Worker cannot ping Master successfully, then Job goes to Failed.
  2. for MPIJob, if Launcher step into Running phase before Workers (Worker pod may hanging on pulling images), Launcher will exit unexpected because kubectl exec command can not reach target container in each Worker, similarly, Job finally failed.

in addition, DAG scheduling can also improve efficiency in certain scenarios, for example: schedule PS before Worker for TFJob to reduce the duration of worker-stall;

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.