Giter VIP home page Giter VIP logo

edl's Introduction




Issues Forks Stars License

Motivation

Computing resources on cloud such as Amazon AWSBaidu Cloud have multi-tenancy. Deep learning model training and inference with elastic resources will be common on cloud. We propose Elastic Deep Learning (EDL) that makes training and inference of deep learning models on cloud easier and more efficient.

Now EDL is an incubation-stage project of the LF AI Foundation.

Installation

EDL package support python2.7/3.6/3.7. You can install with pip install paddle_edl. But we highly recommend you use it in our docker:

docker pull hub.baidubce.com/paddle-edl/paddle_edl:latest-cuda9.0-cudnn7
nvidia-docker run -name paddle_edl hub.baidubce.com/paddle-edl/paddle_edl:latest-cuda9.0-cudnn7 /bin/bash

Latest Release(0.3.1)

  • Support elastic training with inference type services during training, such as knowledge distillation
  • Inference type services are automatically registered through service discovery in EDL
  • Knowledge distillation examples in computer vision and natural language processing

Quick start Demo

  • Install Paddle Serving
pip install paddle-serving-server-gpu
cd example/distill/resnet

wget --no-check-certificate https://paddle-edl.bj.bcebos.com/distill_teacher_model/ResNeXt101_32x16d_wsl_model.tar.gz
tar -zxf ResNeXt101_32x16d_wsl_model.tar.gz

python -m paddle_serving_server_gpu.serve \
  --model ResNeXt101_32x16d_wsl_model \
  --mem_optim \
  --port 9898 \
  --gpu_ids 1
  • The Student Model: ResNet50_vd(that is ResNet-D in paper). Train student on gpu 0.
python -m paddle.distributed.launch --selected_gpus 0 \
  ./train_with_fleet.py \
  --model=ResNet50_vd \
  --data_dir=./ImageNet \
  --use_distill_service=True \
  --distill_teachers=127.0.0.1:9898
mode teacher resource student resource total batch size acc1 acc5 speed(img/s)
pure train None 8 * v100 256 77.1 93.5 1828
teacher and student on the same gpus 8 * v100 8 * v100 256 79.0 94.3 656
EDL service distill 40 * P4 8 * v100 256 79.0 94.5 1514

About Knowledge Distillation in EDL

  • Theory: Distilling the Knowledge in a Neural Network
    • Knowledge distillation consists of two parts in general, i.e. strong teachers and weak students.
    • Student model learns from a teacher or mixture-of-teachers model's feed-forward results to achieve better results.
  • Application scenarios of EDL knowledge distillation
    • Teacher models and student models are running on the same GPU devices that training throughputs are not maximized
    • Offline GPU cluster has limited resources but some online GPU resources can be used during training.
    • Heterogenous teacher models can improve student model's performance but are hard to deploy on a single GPU card due to memory limitation.
    • Computation burden of teacher models and student models is hard to balance to maximize the training throughputs.
  • Solution:
    • Deploy teacher models as online inference service through Paddle Serving
    • Online inference services are elastic and are registered to EDL service management modules.
    • Dynamical adaptation of teacher models' online instance to maximize students' training throughputs and resource utilization.

Release 0.2.0

Checkpoint based elastic training on multiple GPUs

  • We have several training nodes running on each GPU.
  • A master node is responsible for checkpoint saving and all the other nodes are elastic nodes.
  • When elastic nodes join or leave current training job, training hyper-parameter will be adjusted automatically.
  • Newly comming training nodes will load checkpoint from remote FS automatically.
  • A model checkpoint is saved every serveral steps given by user

Resnet50 experiments on a single machine in docker

  • Start a JobServer on one node which generates changing scripts.
cd example/demo/collective
node_ips="127.0.0.1"
python -u paddle_edl.demo.collective.job_server_demo \
    --node_ips ${node_ips} \
    --pod_num_of_node 8 \
    --time_interval_to_change 900 \
    --gpu_num_of_node 8
  • Start a Jobclient which controls the worker process.
# set the ImageNet data path
export PADDLE_EDL_IMAGENET_PATH=<your path>
# set the checkpoint path
export PADDLE_EDL_FLEET_CHECKPOINT_PATH=<your path>

mkdir -p resnet50_pod
unset http_proxy https_proxy

# running under edl
export PADDLE_RUNING_ENV=PADDLE_EDL
export PADDLE_JOB_ID="test_job_id_1234"
export PADDLE_POD_ID="not set"

python -u paddle_edl.demo.collective.job_client_demo \
    --log_level 20 \
    --package_sh ./resnet50/package.sh \
    --pod_path ./resnet50_pod \
    ./train_pretrain.sh
  • Experiments result on 2 nodes cluster
model dataset gpu cards total batch size acc1 acc5
Resnet50 ImageNet 16 * v100 1024 75.5 92.8

The whole example is here

Community

FAQ

License

Contribution

edl's People

Contributors

bjjwwang avatar denkensk avatar gavin1332 avatar gongweibao avatar guru4elephant avatar helinwang avatar hutuxian avatar ibrahimhaddad avatar luotao1 avatar m3ngyang avatar putcn avatar qizheng09 avatar seiriosplus avatar tizhou86 avatar typhoonzero avatar wanghaoshuang avatar wangkuiyi avatar wangxicoding avatar wopeizl avatar yancey1989 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

edl's Issues

[Question]k8s native edl should rely on the capability of PaddlePaddle workload.

To my knowledge, edl should rely on the capability of PaddlePaddle workload. In other words, PaddlePaddle must has the ability of elastic learning.

If my understanding is correct, could you please direct me where I can find the design doc of elastic learning of PaddlePaddle workload? I can understand a bit of edl itself from the source code, but I cannot find anything about the elastic PaddlePaddle workload.

Hyperlinks to our documents are broken

Reported by @Haichao-Zhang :

The hyperlinks in this post on Baidu Research Blog to our EDL documents broke.

Could somebody fix it by writing a diff from the current block post content to the right one, and I can ask the administrator of Baidu Research Blog to correct their content.

Thanks!

Run Fluid with EDL

Tasks

  • full fault-tolerant training
    • design doc, PaddlePaddle/Paddle#11625
    • recoverable trainer process without shutting down the whole job
    • recoverable pserver process without shutting down the whole job
    • distributed task queue to manage tasks in etcd
    • distributed reader to fetch record from task queue
    • pserver HA
  • dynamic trainer count in the pserver side so that we will be able to average gradients according to current trainer count.
  • Upgrade EDL controller to CRD so that we can support Kubernetes higher than v1.8
  • a tutorial to run distributed lookup sparse table with EDL
  • update experiment report, https://github.com/PaddlePaddle/cloud/tree/develop/doc/edl/experiment

Transfer the edl repo to its own GitHub organization

Please complete the transfer of edl repo to its own GitHub org located in: https://github.com/elasticdeeplearning/.

Here are the steps:

  1. Save the README.md file from the https://github.com/elasticdeeplearning/edl to use it later on in step 4 because it has been updated a little
  2. Delete repo https://github.com/elasticdeeplearning/edl (this was created as a copied repo and not as a transfer)
  3. Follow these instructions: https://help.github.jp/enterprise/2.11/user/articles/transferring-a-repository-owned-by-your-personal-account/ under: "Transferring a repository to another user account or to an organization". This will transfer https://github.com/PaddlePaddle/edl to https://github.com/elasticdeeplearning/edl
  4. Copy the saved README.md (see (1) above) into the newly transferred repo (https://github.com/elasticdeeplearning/edl) so we don't have to update it again
  5. https://github.com/PaddlePaddle/edl should not exist anymore. When you try to go to it, it will automatically FW to https://github.com/elasticdeeplearning/edl

Thank you.
Ibrahim

IP address and Port concat error

image
When I tried to run distributed CTR training on K8S Cluster. There is mechanical to collect all pservers' IP and port. But it seems that the concatenation of IP and port looks weird.

initially it is 172.20.1.69:30236, 172.20.1.70:30237, which are pserver 1 and pserver 2's IP.
but then it displays 172.20.1.69:30236, 172.20.1.70.30237:30236. the pserver 2 has two ports.

[request] Please update the links in README

I found that some links in README are 404, I think we should update these links. For example, I am looking for docs for Fault-Tolerant Training in PaddlePaddle. But I cannot find it.

I'd appreciate it if anyone could help me.

Thanks 🥂 🍻

Add a EDL tutorial

We need an EDL tutorial to introduce how to use EDL on a Kubernetes cluster.

[Question] Does the edl rely on the capability of PaddlePaddle workload

To my knowledge, edl should rely on the capability of PaddlePaddle workload. In other words, PaddlePaddle must has the ability of elastic learning.

If my understanding is correct, could you please direct me where I can find the design doc of elastic learning of PaddlePaddle workload? I can understand a bit of edl itself from the source code, but I cannot find anything about the elastic PaddlePaddle workload.

Trouble Running Resnet + Imagenet Demo

Hello, I've been trying to run the resnet + imagenet demo shown here https://github.com/elasticdeeplearning/edl/tree/develop/example/demo/collective for several days now but with no success. I've tried doing this on my local machine by pip installing paddle_edl into a conda environment and all associated requirements in addition to trying with the recommended docker image:

docker pull hub.baidubce.com/paddle-edl/paddle_edl:latest-cuda9.0-cudnn7
nvidia-docker run -name paddle_edl hub.baidubce.com/paddle-edl/paddle_edl:latest-cuda9.0-cudnn7 /bin/bash

I'm having the suspicion that this demo is outdated and I was wondering if it could be updated or explained in more detail so that I can get it working. If the demo is still in fact working, can someone run the demo using the expected docker image and let me know of the exact steps to replicate it.

I've tried running the demo using several different combinations of steps but here is what I'm doing in general.

Reproduction Steps:

  1. Enter the recommended docker image and mount my imagenet dataset.
  2. Enter edl/example/demo/collective
  3. Set PADDLE_EDL_IMAGENET_PATH, PADDLE_EDL_FLEET_CHECKPOINT_PATH and PADDLE_JOBSERVER
  4. Run ./start_job_server.sh
  5. Run ./start_job_client.sh
  6. Find failures in either the pod logs, the worker log in each pod directory or client/server logs

Some issues that I have faced so far:

  • I don't know what the specifications are for train.txt, test.txt or val.txt for the imagenet dataset and have errors using mine with the edl demo. What is the expected preprocessing strategy that edl uses for their imagenet dataset and how is it structured so I can use my own imagenet dataset.
  • This line (
    src_dir=../../../collective/resnet50
    ) should be changed to src_dir=../../collective/resnet50. There must have been some directories moved around as this is not the correct pathway to the resnet files.
  • All but one of the created pods manage to establish a connection its desired endpoint. All the failed pods output a message such as:
not ready endpoints:['127.0.0.1:8073', '127.0.0.1:8075', '127.0.0.1:8077', '127.0.0.1:8079', '127.0.0.1:8081', '127.0.0.1:8083', '127.0.0.1:8085']
server not ready, wait 3 sec to retry...

System information:

  • PaddlePaddle version: I have tried with v1.8.5 locally and whatever version is packaged into the docker image
  • EDL version: I have tried with v0.3.1 locally and whatever version is packaged into the docker image
  • GPU: Tesla M60 with CUDA 9.0 and CUDNN 7.0
  • OS Platform: Ubuntu 16.04.6 LTS

Thanks and looking forward to demoing the project!

【bug】redis client超时失败

需要添加超时失败容错
Exception in thread Thread-3:
Traceback (most recent call last):
File "/root/paddlejob/workspace/env_run/python-2.7.14/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/root/paddlejob/workspace/env_run/python-2.7.14/lib/python2.7/threading.py", line 754, in run
self.__target(*self.__args, **self.__kwargs)
File "/root/paddlejob/workspace/env_run/python-2.7.14/lib/python2.7/site-packages/paddle_edl/distill/redis/client.py", line 95, in _heartbeat
msg = self._recv_msg()
File "/root/paddlejob/workspace/env_run/python-2.7.14/lib/python2.7/site-packages/paddle_edl/distill/redis/client.py", line 55, in _recv_msg
head = self.client.recv(self.HEAD_SIZE)
error: [Errno 110] Connection timed out

[Question] Does edl rely on the PaddlePaddle elastic learning capability

To my knowledge, edl should rely on the capability of PaddlePaddle workload. In other words, PaddlePaddle must has the ability of elastic learning.

If my understanding is correct, could you please direct me where I can find the design doc of elastic learning of PaddlePaddle workload? I can understand a bit of edl itself from the source code, but I cannot find anything about the elastic PaddlePaddle workload.

[Question]k8s native edl should rely on the capability of PaddlePaddle workload.

To my knowledge, edl should rely on the capability of PaddlePaddle workload. In other words, PaddlePaddle must has the ability of elastic learning.

If my understanding is correct, could you please direct me where I can find the design doc of elastic learning of PaddlePaddle workload? I can understand a bit of edl itself from the source code, but I cannot find anything about the elastic PaddlePaddle workload.

[Question]k8s native edl should rely on the capability of PaddlePaddle workload.

To my knowledge, edl should rely on the capability of PaddlePaddle workload. In other words, PaddlePaddle must has the ability of elastic learning.

If my understanding is correct, could you please direct me where I can find the design doc of elastic learning of PaddlePaddle workload? I can understand a bit of edl itself from the source code, but I cannot find anything about the elastic PaddlePaddle workload.

Roadmap for supporting different frameworks

  • design doc of implementing generic python API tools to enable fault tolerant. Developers can insert some lines of code in their training program to enable fault tolerant training -- 2 week with a discussion
  • implement this generic python API available for at least 1 framework. -- 2 weeks
  • polish CRD implementation, and run test on real clusters. -- 2 weeks
  • implement and test for more frameworks: Tensorflow, Keras, Caffe2 -- 4 weeks

Test the edl's function with crd

the test includings:

Build images with Dockerfile;
Push images to docker hub;
Pull the images built above and run the EDL controller in k8s cluster;
Run the training job in k8s cluster to test the ASGD;

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.