snuspl / dolphin Goto Github PK

View Code? Open in Web Editor NEW

14.0 14.0 2.0 2.66 MB

Shell 1.13% Java 98.87%

dolphin's People

Contributors

Stargazers

Watchers

Forkers

taehunkim kiminh

dolphin's Issues

Using Avro for aggregating metrics

Currently, the ObjectSerializableCodec is used to send and receive metrics (run-time information) of tasks.
Due to performance reason, it would be good to use Avro instead of Serializable.
Please leave comments, if there are other parts where we can use Avro.

Add WordCount Example

WordCount is one of common examples in many frameworks.
I have implemented it to practice Dolphin, but I don't know it is good example for Dolphin.
What do you think about it?

Add explanations

We have to add comments for all public classes and public methods within the classes so that we are able to run javadoc to create an API documentation automatically.
Especially, we need detailed explanation for the following classes:

ComputeTask
(what are the steps done in ComputeTask and how ComputeTask uses group comm, considering finite state machine that describes Controller and Compute Task state changes)
DataParser
(how DataParser is working. There are two methods: parse and get. Why do you have two)
Driver
(main logic of the Driver)
Stage
(main logic of a Stage)
UserJobInfo
UserParameters
(what are needed at Driver, and ControllerTask explicitly)
KeyValueStore
(how we use this key value store)
UserComputeTask
(iteration)
Each Machine Learning Algorithm and its important parameters
FlexionConfiguration
FlexionParameters

Interfaces of various DNN components

Before we create a pull request for #66, we'd like to first go over the interfaces of public classes, such as Layer. This is also for minimizing the size of the future pull request.

Add multilayer perceptron algorithm.

Before implementing convolutional neural network, I suggest that we could implement multiplayer perceptron first.

There're some implementations of multilayer perceptron.

C version : Neural Networks for Face Recognition
http://www.cs.cmu.edu/afs/cs.cmu.edu/user/mitchell/ftp/faces.html
Python version : a fully connected neural network
https://github.com/jorgenkg/python-neural-network
http://arctrix.com/nas/python/bpnn.py

At this present time, there's not gorgeous method to transfer data across partition boundary for horizontal or vertical partitioning, but we could use group communication with master & slave model in dolphin.

GroupComm-style Parameter Server

Before we tackle #68, we'd like to implement a Parameter Server that uses the friendly REEF Group Communication. Although this forces Parameter Server communication to be synchronous, we'd be able to verify that our distributed DNN architecture is working correctly.

Implement K-means using the multi-stage programming model

Objective:

Implement K-means using the new multi-stage programming model

Specification:

K-means algorithm corresponds to a Job consisting of two Stages
In the first stage (preprocess), initial centroids are aggregated to Controller Task
In the second stage (main process), centroids are adjusted so that they can represent clusters in data
In the second stage, Compute Tasks assign each data point to the nearest cluster and Controller Task computes new centroids

About squashing commits (rebase) into one

Although Apache encourages each pull request to be a single commit using git rebase, squashing is not always the best option. For example, the history and context of a series of commits vanish in the case of squashing. There's also the overhead of applying git rebase on your code, when you could just push it right away. We're currently not caring about git rebase, for your information, but we can always change format if many think we should take the Apache convention.

Sending evaluator configurations on heartbeats

Our plan was to send evaluator configurations (results of the optimization plan) from the CtrlTask evaluator to the Driver piggybacked on the heartbeat messages. However, according to evaluator_runtime.proto, there are several kinds of messages used:

ContextMessageProto
ContextControlProto
EvaluatorHeartbeatProto
EvaluatorControlProto

I haven't checked when each message is used, but it clearly seems that using the heartbeat to send our own messages isn't our only option. In fact, we could even define our own message. We should have some talk about this.

Fix if statement in KMeansMainCmpTask

Trivial change about coding style in KMeansMainCmpTask.java

// Compute vector sums for each cluster centroid
-      if (pointSum.containsKey(nearestClusterId) == false) {
-        pointSum.put(nearestClusterId, new VectorSum(vector, 1, true));
-      } else {
+      if (pointSum.containsKey(nearestClusterId)) {
         pointSum.get(nearestClusterId).add(vector);
+      } else {
+        pointSum.put(nearestClusterId, new VectorSum(vector, 1, true));
       }

Examine NetworkServiceTests for data migration

Support Multiple Data Sources

Hi, let me show you an example usage of multiple data source what I mentioned before. This is just for discussion. :-)

First of all, NN training algorithm needs two kind of data, i.e. prebuilt-model and training data. The training data might be split into two disjoint sets(test and train) in some algorithms.

Second, there are many options how to provide training data. The most simplest style is label folders containing binary instance files like JPG, MP3, and MP4. The second popular style is using DBMS like LMDB or Level DB. In this issue, let's focus on the first one.

Now, let me describe an specific example of desirable structure for OCR, mnist. (The following path are a local file system or HDFS.)

/data/image/mnist/jpg/0
/data/image/mnist/jpg/1
/data/image/mnist/jpg/2
...
/data/image/mnist/jpg/9
/model/mnist/prebuilt-model
/tmp/new-model

The data folder contains all kinds of data. Here, 0, ..., 9 are the folder names and will be used a class label for the files existing under those folder. prebuilt-model is an optional, but it will be used in most of cases during real DNN trainings. new-model is a temporary folder for new model snapshots. Please note that many algorithms need to save their model snapshot models every predefined epochs.

Dolphin supports multiple stages. So, what about providing their own input dir for each stage?

model_loading_stage: need to load /model/mnist/prebuilt-model
data_loading_stage: need to load /data/image/mnist/jpg/*
data_splitting_stage: (the path is not needed since it can be hidden)
training_stage: need to save into /tmp/new-model (We can do this by using Output Service.)

Please leave any comments here.

Renaming Classes

We need to change the names of some classes including but not limiting to:

UserParameter
(it looks like a configuration factory)
UserComputeTask
UserControllerTask

Too many files for one algorithm

Currently, a user needs to create many files (sometimes more than 10) to implement a single example using Dolphin. Although #7 solves this partially, #7 will not be resolved for a while. It'd be better if we can reduce the number of new files a user needs to code without the help of a DSL.

Implement Logistic Regression and Linear Regression using the multi-stage programming model

Related branch: kj-generalize

I will modify Joo Seong's implementation of Stochastic Gradient Descent (SGD) (branch name: js_sgd) so that it runs on the multi-stage programming model.
And then I will implement Logistic Regression and Linear Regression (machine learning algorithms) using SGD (an optimization algorithm).

Unit tests for algorithms

We don't have any unit tests for algorithms. We should add tests so that we can make sure the algorithms are actually performing intended behaviors.

Duplicated throws in DataParser

I think the duplicated 'ParseException' should be removed in DataParser.java

-  public T get() throws ParseException, ParseException;
+  public T get() throws ParseException;

Define controlTask abstraction

We are defining new abstraction, controlTask that is used as an controller task of iterative job.

How about to provide a controlTask interface that extends the task interface?

Non-linear stages

Currently, all stages are connected in a linear pipeline fashion. We need to consider a more complex execution graph for other algorithms. Let's take a look at Naiad.

Step 2 - Implement system primitives

Implement system primitives for regulating tasks:

Split : split one task into several tasks
Merge : merge two or more tasks into one
Add : add tasks to job
Delete : remove tasks from job

Compute Tasks should send optimization metric to Driver

Compute Tasks should send optimization metric to Driver, using the heartbeat push mechanism.
e.g. Task completion time, communication overheads

Add AutoEncoder testcase

I want to suggest a simple testcase for fully-connected neural network module.

AutoEncoder is not only able to test NN performance without any data, but also is used for pre-training stage for some NN.

Let assume that we use three-layer NN whose size is 65536 x 16 x 65536. During training, we can choose any number n (0 <= n <= 65536) and set one node as 1.0 in both input/output layers identically. The internal layer should eventually be saturated as a binary form of that number.

Here, I just mentioned one hidden layer, but we can use more hidden layers, too.

Implement Expectation Maximization (EM) using the multi-stage programming model

Objective:

Implement Expectation Maximization (EM) algorithm using the multi-stage programming model
Specification:

EM algorithm corresponds to a Job consisting of two Stages
In the first stage (preprocess), initial centroids are aggregated to Controller Task
In the second stage (main process), centroids and covariance matrices are adjusted so that they can represent clusters in data
In the second stage, Compute Tasks computes partial statistics of each cluster, and Controller Task computes new centroids and covariances matrices based on aggregated statistics

Allow user configuration of services

I believe we need a way to allow users to configure services directly (i.e., via Tang). This may be something that can be exposed as part of #26

Some examples come to mind:

OutputService: The user may want to use a different DFS -- e.g., Amazon S3 -- or even something entirely different -- e.g., send an email to the user for a very long-running job.
KeyValueStoreService: The user may want to use a different implementation -- e.g., an off-heap key value store.

Add ALS algorithm

Add the Alternating Least Squares algorithm.

Fix ParseException on blank lines

This issue is found by @jsjason and related to the following existing classes:

ClassificationDataParser
ClusteringDataParser
RegressionDataParser

All the above classes generates the following exceptions on blank lines:

edu.snu.reef.dolphin.core.ParseException: Parse failed: each field should be a number

Resource constraints

How are we going to adjust the optimal scheme from milestone 1 when resource conditions fluctuate?

Step 1 - Define job state

Define what a state of a job is (job state abstraction).

Import shimoga as a jar

Instead of building both shimoga and dolphin, let's just bring shimogapp into this repository.

Add empty input file handling

If the given input parameter is empty, the job is terminated by timeout. Intuitively, Dolphin didn't start the initial data loading stage properly. I'm tracking the proper place to handle this.

./run_kmeans.sh -numCls 4 -convThr 0.01 -maxIter 20 -local true -split 4 -input /dev/null
...
2015-05-14 16:04:42,335 정보 edu.snu.reef.dolphin.core.DolphinLauncher.run main | REEF job completed: FORCE_CLOSED

This issue was derived by #30.

Define optimal job scheme

Define optimal job execution scheme, based on job-completion time.

What actually is an optimal job scheme?
We may need a well-defined definition for "performance".

Group communication class packages

We're still using com.microsoft to import group communication classes. This makes users have to download and build shimoga, which is meaningless starting from REEF 0.11.0. We need to change such import statements to org.apache.

Performance evaluation of ML algorithms

We've implemented some ML algorithms in #2.

As a starting point of #4, we have to do some performance test on ML algorithms.
It would be better to start with K-means that we think it is the most sophisticated implementation among our algorithm.

Measuring factors:

Running time per iteration
Training data set vs running time
Node number vs computation,communication time
Elastis MPI's reconfiguration time (this would soon include Split/Merge system primitives from #2)

Measuring factors can be updated anytime.

We can utilize the newly built logging system from reef's main branch.
apache/reef#8

Domain Specific Language

Define a DSL for users and compile down to MPI.
Refer to other interfaces (such as Spark MLlib).

May not be included in the scope of this project, but rather cmssnu/reef_ml depending on schedule.

Step 5 - Apply scheme by using system primitives

Actually apply the scheme constructed from #4 by using system primitives from #1, taking job state into account to optimize job-processing.

Initial optimization vs. progressive optimization(every N iterations)
Subject 1: Communication vs computation overhead
Subject 2: Stragglers
Subject 3: Task Failures
Other subjects : total iteration number of job. execution time of each task. cpu cycles. network traffic. kind of failure(heap? other?).

Comply with checkstyle

Now that checkstyle has been imported from REEF, we can work to fix the parts that don't comply.

This could be a good "introductory exercise" for CMSLab's summer interns to get familiar with Dolphin code, REEF style conventions, and GitHub collaboration.

Implement a naive neural network that runs on a single evaluator

We are currently implementing a single-evaluator neural network that uses only fully connected layers. Starting from this issue, we will add more features such as more kinds of layers, asynchronous communication with a parameter server, etc.

Add SVD algorithm

Add an algorithm for Singular Value Decomposition.

Design a multi-stage programming model

Objective:

Design a programming model that is simple but able to express various ML algorithms

Specification:

Each Job consists of one or more Stages
Stages are executed on the same evaluators, which maintain Contexts, stage by stage
Data can be passed among Stages using Key-value Store Service
Each Stage follows a BSP programming model
Each Stage consists of one Controller Task and one or more homogeneous Compute Tasks
The Controller Tasks and Compute Tasks communicate each other through Group Communication
Group Communication includes BroadCast, Reduce, Gatter, and Scatter of arbitrary types of data
Each Compute Task and Controller Task consists of initialize, run, and cleanup step.

Add checkstyle

REEF has added checkstyle (run via mvn checkstyle:checkstyle) and is in the process of getting the rules and code to pass. We should apply REEF's latest checkstyle to Dolphin as well. After that, we can work to remove the parts that don't comply.

This could be a good "introductory exercise" for CMSLab's summer interns to get familiar with Dolphin code, REEF style conventions, and GitHub collaboration.

Upgrade to REEF 0.11.0

REEF 0.11.0-incubating is released in maven central. It's time to upgrade to 0.11.0.

Update reef.version in pom.xml
Remove shimoga.version in pom.xml

Fault tolerance

We have to

specify what to do when a fault occurs,
implement codes for fault tolerance,
and test whether this implementation covers various fault cases.

I think that implementing fault tolerance will be closely related to implementing EM (Elastic Memory).

Currently, we just log failed tasks (see TaskFailedHandler in FlexionDriver)and update a group communication topology when it is changed (see updateTopology in ControllerTask)
These methods should be modified and improved.

Add PageRank example

PageRank algorithm is one of famous algorithms.
It would be helpful for newbie for Dolphin.

Configurable part of Driver?

ControllerTask and ComputeTask provide UserControllerTask and UserComputeTask, respectively.
We don't have a similar class for Driver?

Introduce asynchronous parameter server

There are many ways to support data partitioning of a deep neural network under a distributed environment. One of them is to maintain a Parameter Server that communicates with the networks in a asynchronous fashion, suggested by DistBelief and Adam. We should add such a component to implement data partitioning.

User specifies the path of the output directory (local path or HDFS path) for each stage
Each task (Control or Compute) composing a stage creates a separate output file under the directory
User writes outputs through an output stream which can be accessed in all the methods of UserCmpTask and UserCtrlTask

Step 3 - Implement ML algorithims

Implement ML algorithims to use as test applications.
This may be resolved by using algorithms from the cmssnu/reef_ml repository.

snuspl / dolphin Goto Github PK

dolphin's People

Contributors

Stargazers

Watchers

Forkers

dolphin's Issues

Recommend Projects

Recommend Topics

Recommend Org