purine / purine2 Goto Github PK

View Code? Open in Web Editor NEW

255.0 43.0 75.0 910 KB

Purified Purine.

License: BSD 2-Clause "Simplified" License

CMake 4.35% C++ 91.09% Cuda 4.48% Protocol Buffer 0.09%

purine2's Introduction

PURINE2

purine version 2. This framework is described in Purine: A bi-graph based deep learning framework

Directory Structure

common

common codes used across the project. Including abstraction of CUDA, abstraction of uv event loop etc.
caffeine

code taken from Caffe, mainly math functions and some macros from common.hpp in Caffe.
catch

contains the header file of CATCH testing system. It is the unit test framework used in Purine. There are not much unit testing in this code. Since the core math functions are based on cudnn and caffe, it should be no problem. (Though during development I did file a bug report to cudnn, now it is fixed in cudnn v2 rc3)
dispatch

contains definitions of graph, node, op, blob etc. blob wraps tensor, op wraps operation. Different from Purine version 1, there is no standalone dispatcher, the dispatching code is inside blob, op and graph. Construction of a graph can be done by connecting blobs and ops. The resulting Graph is self-dispatchable. By calling graph.run().
composite

contains predefined composite graphs. which can be used to construct larger graphs. For example, all the layers in caffe can be defined as a graph in purine. A network can be constructed by further connecting these predefined graphs.
operations

contains operations and tensor. In this version, tensor is 4 dimensional (It can be changed to ndarray). Operations takes input tensors and generate output tensors. Inputs and outputs of a operation is stored in a std vector. Operations can take parameters, for example, the parameters of convolution contain padding size, stride etc. In the operation folder, there are a bunch of predefined operations.
tests

unit tests of the project.

Tensor and Operation

Tensor

Tensors and operations are the two basic components in purine. Like in Caffe, tensor in purine is 4 dimensional (num, channel, height, width) which is convenient for image data. In MPI, rank is used to denote the process id. In Purine rank is used for the machine ID (which mean there is only one process on each machine). A tensor can reside on different ranks, the rank of a tensor can be get by calling the rank() function of the tensor. On the same rank, tensors can be on different devices. Thus there is another function device which returns the device id that the tensor resides on. In Purine, negative device id are reserved for CPU. Id greater than or equal to zero are for GPUs.

Operation

The constructor of Operation takes two vector of Tensors. One as input and one as output. For example convolution operation takes { bottom, weight } as input and outputs { top }. The constructor of operation checks that the input and output tensors are correct in size and location etc. The compute_cpu and compute_gpu functions are the code for convolution on cpu and gpu respectively. They takes a const vector<bool>& as argument, which has the same size as outputs. This is to denote whether the computed results should be write to the output tensor or add to it. Purine has enough built in operations for daily deep learning usage, wrapping most of the functions in CUDNN package by NVIDIA.

Connection

We can almost do everything by defining a bunch of tensors, and operate on the tensors with different operations sequentially. The computation logic can be implemented by connecting operations with tensors, which forms a bi-partite graph (operations never connects directly to other operations, nor do tensors).

How to execute the calculation sequence stored in the graph?

We want the operations to operate when and only when all its inputs are ready. We need a counter for the operation, so when each input is ready it would trigger a +1 on the counter. When the counter reaches the number of the inputs, a ready signal is emitted by the operation, and the operation starts to operate.

The same thing happens to tensor, we want the tensor to emit ready signal only when results have been received from all the incoming operations. Thus a counter is also needed for each tensor.

Counter is not part of either operation or tensor, but it is needed when executing the graph. That's why Op and Blob are introduced here as wrappers of operation and tensor respectively. So that the counter can be stored in Op/Blob. In purine, the computation logic is stored in the bipartite graph consisting of Ops and Blobs.

Connection types:

{ tensor } >> Op
Op >> { tensor }
{ tensor } >> Connectable
Connectable >> { tensor }
Connectable >> Connectable

The >> operator works through calling the set_input and set_output function in Connectable.

Example to construct a graph.

First construct a runnable.

Runnable run;

create nodes in the runnable.

Blob* bottom = run.create("bottom_name", Size{128, 10, 1, 1});
Blob* weight = run.create("weight_name", Size(16, 10, 1, 1));

Op<Inner>* inner_prod = run.create<Inner>("inner_prod_name", "main", Inner::param_tuple());
// param_tuple is typedefed in the class `Inner`. It is a typedef of tuple<...>.
// It lists the arguments needed when constructing the operation.
// In the case the `Inner` operation does not need any argument.

Blob* top = run.create("top_name", Size(128, 16, 1, 1));
// connect them
vector<Blob*>{ bottom, weight } >> *inner_prod >> vector<Blob*>{ top };

// call run
run.run();
// the graph will be executed (from sources to sinks). But of course you want to set initial values to the Blobs in the real case.

Examples

There are two examples under the examples folder.

Network in Network on the CIFAR10 dataset which achieves 10.4% error rate.
GoogLeNet. We run the googlenet on 12 GPUs using data parallelism. it is able to converge in 20 hours (If your GPU are highend ones which are more stable in temperature, it could be reduced to 17 hours). The error rate is 12.7% (Higher performance may require some tuning as the batch size is quite big as compared to caffe's setting.)

Data parallelism is used in both the above examples, because the fully connected layers are replaced by a global pooling layer, thus the parameter number is small and suitable for data parallelism.

License

Purine is released under the BSD 2-Clause license.

purine2's People

Contributors

Stargazers

Watchers

purine2's Issues

Complete quoting for parameters of some CMake commands

Some parameters (like "${Atlas_INCLUDE_DIR}" and "${CMAKE_BINARY_DIR}/test") are passed to CMake commands in your build scripts without enclosing them by quotation marks. I see that these places will result in build difficulties if the contents of the used variables will contain special characters like semicolons.

I would recommend to apply advices from a wiki article.

Log problem

hi,

I am try the googlenet example. I found it is running, but there is no log output. I don't know why. I also add some test line in the main funcition

LOG(INFO)<< "test";
MPI_LOG(<<"test";);

There is no output either.

confusion about node inc_in

is inc_in in node.cpp thread safe?

thxs

How can I profile the results of purine.

As the picture shows in paper, which tool was used to draw this figure?

about your papers that I want to research them

Have you have published the relevant papers?

Can't compile test_layers.

It seem that test_layers.cpp use the header file in caffe. But after I paste that files, still can't compile the test. There are many redefinition errors, some things are already defined in the common folder. So, which file do I need to paste?

Will you release a windows version, and will you support .prototxt file in the future?

create layers available in Purine1

create all the operations and layers.

Inportant Question aboat InceptionLayer!!!!

In file Inception.hpp

bottom_ >> *one_;                                                                                                                                                                                             
bottom_ >> *three_reduce_ >> *three_;
bottom_ >> *five_reduce_ >> *five_;
bottom_ >> *max_pool_ >> *pool_proj_;

when compute down, data bottom, I can not see any merge the 4 branch data, just recover, Why? I compare other framwork （cxxnet example）to build googlenet, they has the split layer after bottom

invalid conversion for "void " to "const void "

make mpi.cpp with Error like this:
/home/yaocx/workspace/purine2/operations/src/mpi.cpp:14:44: error: invalid conversion from ‘const void_’ to ‘void_’ [-fpermissive]

I just add (void_) in line (14) of mpi.cpp file to make
MPI_CHECK(MPI_Isend((void_)inputs_[0]->cpu_data(), inputs_[0]->size().count(),
MPI_FLOAT, dest, tag, MPI_COMM_WORLD, &request));

or maybe you can change the makefile with flag -fpermissive

Any good advice?

Is there some bug on "lrn_layer"

We test the lrn_layer and find this layer may cause crash on "Titan X" while work well on "k40". Is this layer debugging?

example nin_cifar does not work

2.411124 0.132812
2.542028 0.132812
2.302772 0.125000
2.312779 0.109375
2.310245 0.117188
2.302449 0.101562
2.306294 0.085938
2.301918 0.148438
2.302899 0.031250
2.302605 0.132812
2.302575 0.132812
2.302580 0.109375
2.302587 0.093750
2.302583 0.117188
2.302583 0.054688
2.302583 0.093750
2.302583 0.070312
2.302583 0.109375
2.302583 0.078125
2.302583 0.125000
2.302583 0.109375
2.302583 0.062500
2.302583 0.109375
2.302583 0.093750
2.302583 0.125000
2.302583 0.109375
2.302583 0.070312
2.302583 0.093750

I print the loss and accuracy and find loss becomes const after several iterator and net does not convergence. I think it is a bug.

By the way, example goolent is work.

create graph for copying

inter and intra machine data copying graph.

Discuss the way to reach cifar-10 accuracy 89%

I used the demo nin_cifar_10, the iteration is 50000, decay the learning by div 10.0 every 20000 iterator, and just get 87% accuracy rate.

question about conv layer?

when I move the batch_normalize form branch of caffe to purine, I found convlayer create a shared tensor temp data:
conv_layer.hpp
93 Blob* tmp_data = create("before_act", top_[0]->shared_tensor());
94 Blob* tmp_diff = create("before_act_diff", top_[1]->shared_tensor());
and bind activation:
103 B{ tmp_data, tmp_diff } >> *act >> top_;
because of shared tensor, memory of tmp_data is top[0]'s memory, after forward, activation 'bottom[0] will be cover, so will lead some error of in activation_down....

I finish my batch_noramlization in purine, when I shared data just like convLayer, my train result(googlenet with batch_normalization) is bad,
and use temp data independent is ok,
so Is the conv layer right?

MPI errors when compiling examples.

I post my question here, hope more people will look at it.
I am using OpenMPI 1.6.5

(pyenv)[zxx@ga85 examples]$ make googlenet
[ 2%] Built target googlenet.o
[ 10%] Built target purine_cu
[ 15%] Built target proto
[100%] Built target purine
Linking CXX executable ../test/googlenet
CMakeFiles/googlenet.o.dir/googlenet.cpp.o: In function MPI::Intracomm::Intracomm()': googlenet.cpp:(.text._ZN3MPI9IntracommC2Ev[_ZN3MPI9IntracommC5Ev]+0x14): undefined reference toMPI::Comm::Comm()'
CMakeFiles/googlenet.o.dir/googlenet.cpp.o: In function MPI::Intracomm::Intracomm(ompi_communicator_t*)': googlenet.cpp:(.text._ZN3MPI9IntracommC2EP19ompi_communicator_t[_ZN3MPI9IntracommC5EP19ompi_communicator_t]+0x19): undefined reference toMPI::Comm::Comm()'
CMakeFiles/googlenet.o.dir/googlenet.cpp.o: In function MPI::Op::Init(void (*)(void const*, void*, int, MPI::Datatype const&), bool)': googlenet.cpp:(.text._ZN3MPI2Op4InitEPFvPKvPviRKNS_8DatatypeEEb[_ZN3MPI2Op4InitEPFvPKvPviRKNS_8DatatypeEEb]+0x24): undefined reference toompi_mpi_cxx_op_intercept'
CMakeFiles/googlenet.o.dir/googlenet.cpp.o:(.rodata._ZTVN3MPI3WinE[_ZTVN3MPI3WinE]+0x48): undefined reference to MPI::Win::Free()' CMakeFiles/googlenet.o.dir/googlenet.cpp.o:(.rodata._ZTVN3MPI8DatatypeE[_ZTVN3MPI8DatatypeE]+0x78): undefined reference toMPI::Datatype::Free()'
collect2: error: ld returned 1 exit status
make[3]: *** [test/googlenet] Error 1
make[2]: *** [examples/CMakeFiles/googlenet.dir/all] Error 2
make[1]: *** [examples/CMakeFiles/googlenet.dir/rule] Error 2
make: *** [examples/CMakeFiles/googlenet.dir/rule] Error 2

How to compile example nin_cifar10.cpp

Any one knows how to compile the examples? After “make all” at ${purine}, it built purine.a at ${purine}. But when I type “make nin_cifar10”, it report error. Am I doing it correctly? Please help me..

integrate and test with intel mpi

Randomly stop training in nin_cifar10

Hi,
After struggling through all the compile problems and a quick lesson about MPI, I can finally use the nin_cifar10 example, but I find that the training process may be hung and the GPU utilization falls to 0% at any time during training(sometimes after hundreds of iterations sometimes even more). Can that be the problem about async()/ sync()?
Thanks

purine / purine2 Goto Github PK

purine2's Introduction

PURINE2

Directory Structure

common

caffeine

catch

dispatch

composite

operations

tests