Giter VIP home page Giter VIP logo

dolphin's People

Contributors

bchocho avatar beomyeol avatar bgchun avatar dongjoon-hyun avatar jsjason avatar jsryu21 avatar kijungs avatar swlsw avatar wynot12 avatar yunseong avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

taehunkim kiminh

dolphin's Issues

Using Avro for aggregating metrics

Currently, the ObjectSerializableCodec is used to send and receive metrics (run-time information) of tasks.
Due to performance reason, it would be good to use Avro instead of Serializable.
Please leave comments, if there are other parts where we can use Avro.

Add WordCount Example

WordCount is one of common examples in many frameworks.
I have implemented it to practice Dolphin, but I don't know it is good example for Dolphin.
What do you think about it?

Add explanations

We have to add comments for all public classes and public methods within the classes so that we are able to run javadoc to create an API documentation automatically.
Especially, we need detailed explanation for the following classes:

  • ComputeTask
    (what are the steps done in ComputeTask and how ComputeTask uses group comm, considering finite state machine that describes Controller and Compute Task state changes)
  • DataParser
    (how DataParser is working. There are two methods: parse and get. Why do you have two)
  • Driver
    (main logic of the Driver)
  • Stage
    (main logic of a Stage)
  • UserJobInfo
  • UserParameters
    (what are needed at Driver, and ControllerTask explicitly)
  • KeyValueStore
    (how we use this key value store)
  • UserComputeTask
    (iteration)
  • Each Machine Learning Algorithm and its important parameters
  • FlexionConfiguration
  • FlexionParameters

Interfaces of various DNN components

Before we create a pull request for #66, we'd like to first go over the interfaces of public classes, such as Layer. This is also for minimizing the size of the future pull request.

Add multilayer perceptron algorithm.

Before implementing convolutional neural network, I suggest that we could implement multiplayer perceptron first.

There're some implementations of multilayer perceptron.

  1. C version : Neural Networks for Face Recognition
    http://www.cs.cmu.edu/afs/cs.cmu.edu/user/mitchell/ftp/faces.html
  2. Python version : a fully connected neural network
    https://github.com/jorgenkg/python-neural-network
    http://arctrix.com/nas/python/bpnn.py

At this present time, there's not gorgeous method to transfer data across partition boundary for horizontal or vertical partitioning, but we could use group communication with master & slave model in dolphin.

GroupComm-style Parameter Server

Before we tackle #68, we'd like to implement a Parameter Server that uses the friendly REEF Group Communication. Although this forces Parameter Server communication to be synchronous, we'd be able to verify that our distributed DNN architecture is working correctly.

Implement K-means using the multi-stage programming model

Objective:

  • Implement K-means using the new multi-stage programming model

Specification:

  • K-means algorithm corresponds to a Job consisting of two Stages
  • In the first stage (preprocess), initial centroids are aggregated to Controller Task
  • In the second stage (main process), centroids are adjusted so that they can represent clusters in data
  • In the second stage, Compute Tasks assign each data point to the nearest cluster and Controller Task computes new centroids

About squashing commits (rebase) into one

Although Apache encourages each pull request to be a single commit using git rebase, squashing is not always the best option. For example, the history and context of a series of commits vanish in the case of squashing. There's also the overhead of applying git rebase on your code, when you could just push it right away. We're currently not caring about git rebase, for your information, but we can always change format if many think we should take the Apache convention.

Sending evaluator configurations on heartbeats

Our plan was to send evaluator configurations (results of the optimization plan) from the CtrlTask evaluator to the Driver piggybacked on the heartbeat messages. However, according to evaluator_runtime.proto, there are several kinds of messages used:

  • ContextMessageProto
  • ContextControlProto
  • EvaluatorHeartbeatProto
  • EvaluatorControlProto

I haven't checked when each message is used, but it clearly seems that using the heartbeat to send our own messages isn't our only option. In fact, we could even define our own message. We should have some talk about this.

Fix if statement in KMeansMainCmpTask

Trivial change about coding style in KMeansMainCmpTask.java

// Compute vector sums for each cluster centroid
-      if (pointSum.containsKey(nearestClusterId) == false) {
-        pointSum.put(nearestClusterId, new VectorSum(vector, 1, true));
-      } else {
+      if (pointSum.containsKey(nearestClusterId)) {
         pointSum.get(nearestClusterId).add(vector);
+      } else {
+        pointSum.put(nearestClusterId, new VectorSum(vector, 1, true));
       }

Support Multiple Data Sources

Hi, let me show you an example usage of multiple data source what I mentioned before. This is just for discussion. :-)

First of all, NN training algorithm needs two kind of data, i.e. prebuilt-model and training data. The training data might be split into two disjoint sets(test and train) in some algorithms.

Second, there are many options how to provide training data. The most simplest style is label folders containing binary instance files like JPG, MP3, and MP4. The second popular style is using DBMS like LMDB or Level DB. In this issue, let's focus on the first one.

Now, let me describe an specific example of desirable structure for OCR, mnist. (The following path are a local file system or HDFS.)

/data/image/mnist/jpg/0
/data/image/mnist/jpg/1
/data/image/mnist/jpg/2
...
/data/image/mnist/jpg/9
/model/mnist/prebuilt-model
/tmp/new-model

The data folder contains all kinds of data. Here, 0, ..., 9 are the folder names and will be used a class label for the files existing under those folder. prebuilt-model is an optional, but it will be used in most of cases during real DNN trainings. new-model is a temporary folder for new model snapshots. Please note that many algorithms need to save their model snapshot models every predefined epochs.

Dolphin supports multiple stages. So, what about providing their own input dir for each stage?

  • model_loading_stage: need to load /model/mnist/prebuilt-model
  • data_loading_stage: need to load /data/image/mnist/jpg/*
  • data_splitting_stage: (the path is not needed since it can be hidden)
  • training_stage: need to save into /tmp/new-model (We can do this by using Output Service.)

Please leave any comments here.

Renaming Classes

We need to change the names of some classes including but not limiting to:

  • UserParameter
    (it looks like a configuration factory)
  • UserComputeTask
  • UserControllerTask

Too many files for one algorithm

Currently, a user needs to create many files (sometimes more than 10) to implement a single example using Dolphin. Although #7 solves this partially, #7 will not be resolved for a while. It'd be better if we can reduce the number of new files a user needs to code without the help of a DSL.

Unit tests for algorithms

We don't have any unit tests for algorithms. We should add tests so that we can make sure the algorithms are actually performing intended behaviors.

Duplicated throws in DataParser

I think the duplicated 'ParseException' should be removed in DataParser.java

-  public T get() throws ParseException, ParseException;
+  public T get() throws ParseException;

Define controlTask abstraction

We are defining new abstraction, controlTask that is used as an controller task of iterative job.

How about to provide a controlTask interface that extends the task interface?

Non-linear stages

Currently, all stages are connected in a linear pipeline fashion. We need to consider a more complex execution graph for other algorithms. Let's take a look at Naiad.

Step 2 - Implement system primitives

Implement system primitives for regulating tasks:

  1. Split : split one task into several tasks
  2. Merge : merge two or more tasks into one
  3. Add : add tasks to job
  4. Delete : remove tasks from job

Add AutoEncoder testcase

I want to suggest a simple testcase for fully-connected neural network module.

AutoEncoder is not only able to test NN performance without any data, but also is used for pre-training stage for some NN.

Let assume that we use three-layer NN whose size is 65536 x 16 x 65536. During training, we can choose any number n (0 <= n <= 65536) and set one node as 1.0 in both input/output layers identically. The internal layer should eventually be saturated as a binary form of that number.

Here, I just mentioned one hidden layer, but we can use more hidden layers, too.

Implement Expectation Maximization (EM) using the multi-stage programming model

Objective:

Implement Expectation Maximization (EM) algorithm using the multi-stage programming model
Specification:

EM algorithm corresponds to a Job consisting of two Stages
In the first stage (preprocess), initial centroids are aggregated to Controller Task
In the second stage (main process), centroids and covariance matrices are adjusted so that they can represent clusters in data
In the second stage, Compute Tasks computes partial statistics of each cluster, and Controller Task computes new centroids and covariances matrices based on aggregated statistics

Allow user configuration of services

I believe we need a way to allow users to configure services directly (i.e., via Tang). This may be something that can be exposed as part of #26

Some examples come to mind:

  • OutputService: The user may want to use a different DFS -- e.g., Amazon S3 -- or even something entirely different -- e.g., send an email to the user for a very long-running job.
  • KeyValueStoreService: The user may want to use a different implementation -- e.g., an off-heap key value store.

Fix ParseException on blank lines

This issue is found by @jsjason and related to the following existing classes:

  • ClassificationDataParser
  • ClusteringDataParser
  • RegressionDataParser

All the above classes generates the following exceptions on blank lines:

edu.snu.reef.dolphin.core.ParseException: Parse failed: each field should be a number

Resource constraints

How are we going to adjust the optimal scheme from milestone 1 when resource conditions fluctuate?

Import shimoga as a jar

Instead of building both shimoga and dolphin, let's just bring shimogapp into this repository.

Add empty input file handling

If the given input parameter is empty, the job is terminated by timeout. Intuitively, Dolphin didn't start the initial data loading stage properly. I'm tracking the proper place to handle this.

./run_kmeans.sh -numCls 4 -convThr 0.01 -maxIter 20 -local true -split 4 -input /dev/null
...
2015-05-14 16:04:42,335 정보 edu.snu.reef.dolphin.core.DolphinLauncher.run main | REEF job completed: FORCE_CLOSED

This issue was derived by #30.

Define optimal job scheme

Define optimal job execution scheme, based on job-completion time.

What actually is an optimal job scheme?
We may need a well-defined definition for "performance".

Group communication class packages

We're still using com.microsoft to import group communication classes. This makes users have to download and build shimoga, which is meaningless starting from REEF 0.11.0. We need to change such import statements to org.apache.

Performance evaluation of ML algorithms

We've implemented some ML algorithms in #2.

As a starting point of #4, we have to do some performance test on ML algorithms.
It would be better to start with K-means that we think it is the most sophisticated implementation among our algorithm.

Measuring factors:

  • Running time per iteration
  • Training data set vs running time
  • Node number vs computation,communication time
  • Elastis MPI's reconfiguration time (this would soon include Split/Merge system primitives from #2)

Measuring factors can be updated anytime.

We can utilize the newly built logging system from reef's main branch.
apache/reef#8

Domain Specific Language

Define a DSL for users and compile down to MPI.
Refer to other interfaces (such as Spark MLlib).

May not be included in the scope of this project, but rather cmssnu/reef_ml depending on schedule.

Step 5 - Apply scheme by using system primitives

Actually apply the scheme constructed from #4 by using system primitives from #1, taking job state into account to optimize job-processing.

  • Initial optimization vs. progressive optimization(every N iterations)
  • Subject 1: Communication vs computation overhead
  • Subject 2: Stragglers
  • Subject 3: Task Failures
  • Other subjects : total iteration number of job. execution time of each task. cpu cycles. network traffic. kind of failure(heap? other?).

Comply with checkstyle

Now that checkstyle has been imported from REEF, we can work to fix the parts that don't comply.

This could be a good "introductory exercise" for CMSLab's summer interns to get familiar with Dolphin code, REEF style conventions, and GitHub collaboration.

Design a multi-stage programming model

Objective:

  • Design a programming model that is simple but able to express various ML algorithms

Specification:

  • Each Job consists of one or more Stages
  • Stages are executed on the same evaluators, which maintain Contexts, stage by stage
  • Data can be passed among Stages using Key-value Store Service
  • Each Stage follows a BSP programming model
  • Each Stage consists of one Controller Task and one or more homogeneous Compute Tasks
  • The Controller Tasks and Compute Tasks communicate each other through Group Communication
  • Group Communication includes BroadCast, Reduce, Gatter, and Scatter of arbitrary types of data
  • Each Compute Task and Controller Task consists of initialize, run, and cleanup step.

Add checkstyle

REEF has added checkstyle (run via mvn checkstyle:checkstyle) and is in the process of getting the rules and code to pass. We should apply REEF's latest checkstyle to Dolphin as well. After that, we can work to remove the parts that don't comply.

This could be a good "introductory exercise" for CMSLab's summer interns to get familiar with Dolphin code, REEF style conventions, and GitHub collaboration.

Upgrade to REEF 0.11.0

REEF 0.11.0-incubating is released in maven central. It's time to upgrade to 0.11.0.

  • Update reef.version in pom.xml
  • Remove shimoga.version in pom.xml

Fault tolerance

We have to

  1. specify what to do when a fault occurs,
  2. implement codes for fault tolerance,
  3. and test whether this implementation covers various fault cases.

I think that implementing fault tolerance will be closely related to implementing EM (Elastic Memory).

Currently, we just log failed tasks (see TaskFailedHandler in FlexionDriver)and update a group communication topology when it is changed (see updateTopology in ControllerTask)
These methods should be modified and improved.

Add PageRank example

PageRank algorithm is one of famous algorithms.
It would be helpful for newbie for Dolphin.

Configurable part of Driver?

ControllerTask and ComputeTask provide UserControllerTask and UserComputeTask, respectively.
We don't have a similar class for Driver?

Introduce asynchronous parameter server

There are many ways to support data partitioning of a deep neural network under a distributed environment. One of them is to maintain a Parameter Server that communicates with the networks in a asynchronous fashion, suggested by DistBelief and Adam. We should add such a component to implement data partitioning.

Separate communication groups for different stages

We currently use a separate communication group for each stage. However, the user can't set the number of Tasks differently for each stage, which makes multiple communication groups pretty meaningless. What do you think, @kijungshin?

Rename project

We should rename the current Flexion project.
Please leave comments if you have any idea.

Tests for dolphin core

We don't have any tests to verify the control flow of dolphin's core. Such tests could use UserControllerTask, UserComputeTask, and group communication mocks to check ControllerTask and ComputeTask are communicating as intended.

Implement a file output service

Objectives:
Implement a file output service.
This service will ultimately be added to the reef-io project.

Specifications:

  • User specifies the path of the output directory (local path or HDFS path) for each stage
  • Each task (Control or Compute) composing a stage creates a separate output file under the directory
  • User writes outputs through an output stream which can be accessed in all the methods of UserCmpTask and UserCtrlTask

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.