Giter VIP home page Giter VIP logo

non_iid_dml's Introduction

The Non-IID Data Quagmire of Decentralized Machine Learning

License

This repo is the source code for our paper: The Non-IID Data Quagmire of Decentralized Machine Learning (ICML'20). This repo also contains our implementation of Gaia (NSDI'17).

This source code is built on a caffe-based parameter server system, GeePS.

The following steps assume compatible CUDA and CuDNN are installed. The code is tested on Ubuntu 16.04 with CUDA 10.2 and CuDNN 7.6.5.

If you use docker, you can start with nvidia/cuda:10.2-cudnn7-devel-ubuntu16.04.

Build the application

As the following steps need to install dependencies, we recommend the user executes these steps only in a clean machine or docker container.

First, clone the project repo and switch into the root directory:

git clone https://github.com/kevinhsieh/non_iid_dml.git
cd non_iid_dml

If you use the Ubuntu 16.04 system, you can run the following commands to install the dependencies:

./scripts/install-geeps-deps-ubuntu16.sh
./scripts/install-caffe-deps-ubuntu16.sh

Also, please make sure your CUDA library is installed in /usr/local/cuda.

Note that all the nodes for the experiments need to install the aforementioned dependencies.

After installing the dependencies, you can build the application by simply running these commands:

cd apps/caffe
./make_all.sh

You can optionally create an VM image after above steps if you run this code on a public cloud. This will make sure all the nodes have the same environment and built binary to run the experiments. Example for Azure can be found here.

Run CIFAR-10 on two machines in the IID and Non-IID settings

All commands in this section are executed from the apps/caffe directory:

cd apps/caffe

You will first need to prepare a machine file as examples/cifar10/2parts/machinefile, with each line being the host name (or IP) of one machine. Since we use two machines in this example, this machine file should have two lines, such as:

h0
h1

We will use pdsh to launch commands on those machines with the ssh protocol, so please make sure that you can ssh to those machines without password (e.g., using private keys). You can also optionally use this command to eliminate errors from SSH:

export PDSH_SSH_ARGS_APPEND="-q -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null"

When you have your machine file in ready, you can run the following command to download and prepare the CIFAR-10 dataset:

./data/cifar10/get_cifar10.sh
./examples/cifar10/2parts/create_cifar10_pdsh.sh

Our script will partition the dataset into two sets of partitions. The partitions for the IID setting is in ./examples/cifar10/2parts/shuffled, and the ones for the Non-IID setting is in ./examples/cifar10/2parts/skewed.

You need to copy these newly created data partitions to the other node. For example:

scp -q -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -r ./examples/cifar10/2parts/shuffled h1:$(pwd)/examples/cifar10/2parts/
scp -q -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -r ./examples/cifar10/2parts/skewed h1:$(pwd)/examples/cifar10/2parts/

Finally, you should set up an output folder that can be accessed by the nodes. For example:

export OUTPUT_DATA_PATH="$(pwd)/outputs"

Once the data partitions are ready, you can execute the following command to run different decentralized learning algorithms and DNNs in the IID and Non-IID settings. For example, this command runs BSP, DeepGradientCompression, Gaia, and FederatedAveraging for GN-LeNet on 2 partitions:

./run_cifar10_exps.sh gnlenet 2

The script run_cifar10_exps.sh contains all the details of configurations. You can change this script to run various hyper-parameters and configurations.

Once the experiment is done, you can get the validation accuracy results by running

python get_cifar_result_2parts_curve.py <output folder>/output.txt

In the paper, we use 5 partitions for CIFAR-10. To do so, you can repeat above steps with 5 machines and change the number of partitions from 2 to 5.

The example folder also contains the example codes for other datasets and applications such as ImageNet (imagenet), Face Recognition (casia), and our Flickr-Mammal dataset (geoanimal).

Reference Papers

If you use our code in your work, we would appreciate a reference to the following papers

Kevin Hsieh, Amar Phanishayee, Onur Mutlu, and Phillip B Gibbons. The Non-IID Data Quagmire of Decentralized Machine Learning. Proceedings of the 37th International Conference on Machine Learning (ICML), 2020.

Kevin Hsieh, Aaron Harlap, Nandita Vijaykumar, Dimitris Konomis, Gregory R. Ganger, Phillip B. Gibbons and Onur Mutlu. Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds. Proceedings of the 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI), 2017.

non_iid_dml's People

Contributors

kevinhsieh avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.