Giter VIP home page Giter VIP logo

iga-adi-giraph's Introduction

IGA-ADI Giraph Solver

Build Status

Prerequisites

You have to have JDK 11 installed to be able to compile the project. You might want to use SdkMan to manage this. You also have to have mvn 3.5.3 installed in your system. For processing the results you need Excel and Node 12.10.0. You might want to use nvm (node version manager) to help you.

How to run

This solver can be run in the cloud in the matter of minutes. The scripts in this repository are prepared for Google Cloud Platform (GCP), although it would work in the similar way in any cloud. In fact, it has been tested in Azure and AWS.

First, you have to create appropriate Hadoop cluster. In case of GCP it is called Dataproc.

Modify one of two scripts to match your needs:

  • bin/local/create.cluster.sh, good for running experiments

  • bin/local/create.singlenode.cluster.sh, good for testing the setup

The most important options there are:

  • --master-machine-type=n1-standard-4, which selects the node type for the master

  • --worker-machine-type=n1-standard-8, which selects the node type for the workers

  • --master-min-cpu-platform="Intel Skylake", which selects the minimum CPU platform for the master

  • --worker-min-cpu-platform="Intel Skylake", which selects the minimum CPU platform for the workers

  • --num-workers=4, which selects the number of workers

Once you modify the script according to your liking execute it and wait for the cluster to be created. Next, you can issue the command that will package the solver and then publish it along with all necessary scripts into the master node of your newly created cluster.

./bin/local/publish.cloud.sh <your master instance number>

where <your master instance number> by default is iga-adi-m

Then you need to connect to the instance.

./bin/local/connect.sh <your master instance number>

where <your master instance number> by default is iga-adi-m

Running the experiments

Once you’re connected you have multiple ways of running the experiments. You can run a series of experiments, one for each value of the modified parameter like in the following example:

IGA_PROBLEM_SIZE=3072 \
IGA_WORKERS=4 \
IGA_WORKER_MEMORY=8 \
IGA_WORKER_CORES=8 \
IGA_STEPS=2 \
RUNS=1 \
SUITE_NAME=my-experiment-name \
IGA_CONTAINER_JVM_OPTIONS="-XX:+PrintFlagsFinal -XX:+UnlockDiagnosticVMOptions -XX:+UseParallelGC -XX:+UseParallelOldGC" \
./run.suite.sh IGA_WORKERS 4 2 1

In here, all values passed as environment variables are static in this suite, and we change the number of the workers used by changing the variable IGA_WORKERS using first 4, then 2 and 1 at the end. You can use any variable in here provided you want to fix the number of workers.

You may want to change the value of SUITE_NAME to introduce some order into your result files as they will be catalogued based on this name.

You can also define your own test suites in and keep them in the repository. See bin/suites for the details. For instance, bin/cluster/suites/03-explicit-configs-to-run.sh provides a list of explicit configs to run in a sequence.

./suites/<your suite>.sh

Once the experiment is complete, make sure to retrieve your results to your local machine before you delete your cluster. Issue the following command from your local machine.

./bin/local/retrieve.cloud.sh <your master instance number>

where <your master instance number> by default is iga-adi-m

Processing the results

This repository contains a number of scripts which allow computing various statistics using the results generated in the experiments and visualising their properties.

Once you retrieve your results look into the logs directory. There will be a separate directory for each run in the suite of your experiments. It will start with the suite name, followed by the unique application identifier.

In order to process the results aggregate them in the structure similar to what is in results-sample directory, that is group by problem size.

Once you do this, you should be able to run

./results-external-extraction/extract-suite.sh <the directory of your suite>

where <the directory of your suite> is the base directory of the directories which correspond to your problem sizes. That will generate the CSV file to the console. Copy it over into the template Excel file located under results-external-extraction/scalability_template.xlsx. You might need to use a regular "text into columns" functionality to make it fill individual cells correctly. This will calculate speedup and other global metrics for you which are necessary for some visualisations. Save it in a separate file.

Most times you will also need to learn the internals of your experiments - what was happening in the cluster. For that, you need to run:

node build/main/index.js -i <the path to your simulations directory> -o <the path to the output excel file>

This should produce an Excel file with many rows and columns, each describing a particular superstep for all experiments.

Finally, using these two Excel files, you can regenerate images. Do this by executing:

./results-charts/regenerate-images.sh <SCALABILITY_XLSX> <SUPERSTEPS_XLSX>

This will take some time depending on the number of your experiments (sometimes even an hour). The images will be continuously generated under results-charts/out directory.

iga-adi-giraph's People

Contributors

kboom avatar

Watchers

 avatar

iga-adi-giraph's Issues

Allow odd worker count

For now odd worker counts cause the problem with partitioning

Exception in thread "org.apache.giraph.master.MasterThread" java.lang.IllegalStateException: java.lang.ArithmeticException: mode was UNNECESSARY, but rounding was necessary
	at org.apache.giraph.master.MasterThread.run(MasterThread.java:201)
Caused by: java.lang.ArithmeticException: mode was UNNECESSARY, but rounding was necessary
	at com.google.common.math.MathPreconditions.checkRoundingUnnecessary(MathPreconditions.java:81)
	at com.google.common.math.IntMath.log2(IntMath.java:91)
	at edu.agh.iga.adi.giraph.direction.PartitioningStrategy.partitioningStrategy(PartitioningStrategy.java:40)
	at edu.agh.iga.adi.giraph.direction.io.IgaTreeSplitter.allSplitsFor(IgaTreeSplitter.java:25)
	at edu.agh.iga.adi.giraph.direction.io.InMemoryStepInputFormat.getSplits(InMemoryStepInputFormat.java:77)
	at org.apache.giraph.io.internal.WrappedVertexInputFormat.getSplits(WrappedVertexInputFormat.java:72)
	at org.apache.giraph.master.BspServiceMaster.generateInputSplits(BspServiceMaster.java:329)
	at org.apache.giraph.master.BspServiceMaster.createInputSplits(BspServiceMaster.java:624)
	at org.apache.giraph.master.BspServiceMaster.createVertexInputSplits(BspServiceMaster.java:668)
	at org.apache.giraph.master.MasterThread.run(MasterThread.java:113)

Ensure OutputFormat thread safety

If VERTEX_OUTPUT_FORMAT_THREAD_SAFE is set to true and there are multiple threads set in NUM_COMPUTE_THREADS (so in NUM_OUTPUT_THREADS by default as well too) then we get:

2019-09-22 09:04:53,987 ERROR [org.apache.giraph.utils.LogStacktraceCallable] - Execution of callable failed
java.lang.IllegalStateException: getVertexWriter: IOException occurred
	at org.apache.giraph.io.superstep_output.MultiThreadedSuperstepOutput.getVertexWriter(MultiThreadedSuperstepOutput.java:89)
	at org.apache.giraph.graph.ComputeCallable.call(ComputeCallable.java:153)
	at org.apache.giraph.graph.ComputeCallable.call(ComputeCallable.java:70)
	at org.apache.giraph.utils.LogStacktraceCallable.call(LogStacktraceCallable.java:67)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException): Failed to CREATE_FILE /user/kbhit/1569143060/_temporary/1/_temporary/attempt_1569140710858_0005_m_000001_1/step-0/part-m-00001 for DFSClient_NONMAPREDUCE_-1931685888_1 on 10.164.0.19 because DFSClient_NONMAPREDUCE_-1931685888_1 is already the current lease holder.
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:2412)
	at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.startFile(FSDirWriteFileOp.java:357)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2309)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2230)
	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:745)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:413)
	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:503)
	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)
	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:871)
	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:817)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893)
	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2606)

	at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1507)
	at org.apache.hadoop.ipc.Client.call(Client.java:1453)
	at org.apache.hadoop.ipc.Client.call(Client.java:1363)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:227)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
	at com.sun.proxy.$Proxy10.create(Unknown Source)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.create(ClientNamenodeProtocolTranslatorPB.java:297)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
	at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
	at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
	at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
	at com.sun.proxy.$Proxy11.create(Unknown Source)
	at org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:267)
	at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1206)
	at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1148)
	at org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:480)
	at org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:477)
	at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
	at org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:477)
	at org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:418)
	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1067)
	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1048)
	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:937)
	at org.apache.giraph.io.formats.GiraphTextOutputFormat.getRecordWriter(GiraphTextOutputFormat.java:67)
	at org.apache.giraph.io.formats.TextVertexOutputFormat$TextVertexWriter.createLineRecordWriter(TextVertexOutputFormat.java:116)
	at org.apache.giraph.io.formats.TextVertexOutputFormat$TextVertexWriter.initialize(TextVertexOutputFormat.java:97)
	at edu.agh.iga.adi.giraph.direction.io.StepVertexOutputFormat$IdWithValueVertexWriter.initialize(StepVertexOutputFormat.java:80)
	at org.apache.giraph.io.internal.WrappedVertexOutputFormat$1.initialize(WrappedVertexOutputFormat.java:82)
	at org.apache.giraph.io.superstep_output.MultiThreadedSuperstepOutput.getVertexWriter(MultiThreadedSuperstepOutput.java:87)
	... 7 more
2019-09-22 09:04:53,997 ERROR [org.apache.giraph.worker.BspServiceWorker] - unregisterHealth: Got failure, unregistering health on /_hadoopBsp/giraph_yarn_application_1569140710858_0005/_applicationAttemptsDir/0/_superstepDir/0/_workerHealthyDir/iga-adi-w-1.europe-west4-a.c.charismatic-cab-252315.internal_1 on superstep 0
2019-09-22 09:04:54,000 ERROR [org.apache.giraph.yarn.GiraphYarnTask] - GiraphYarnTask threw a top-level exception, failing task
java.lang.RuntimeException: run: Caught an unrecoverable exception Exception occurred
	at org.apache.giraph.yarn.GiraphYarnTask.run(GiraphYarnTask.java:106)
	at org.apache.giraph.yarn.GiraphYarnTask.main(GiraphYarnTask.java:184)
Caused by: java.lang.IllegalStateException: Exception occurred
	at org.apache.giraph.utils.ProgressableUtils.getResultsWithNCallables(ProgressableUtils.java:274)
	at org.apache.giraph.graph.GraphTaskManager.processGraphPartitions(GraphTaskManager.java:813)
	at org.apache.giraph.graph.GraphTaskManager.execute(GraphTaskManager.java:361)
	at org.apache.giraph.yarn.GiraphYarnTask.run(GiraphYarnTask.java:93)
	... 1 more
Caused by: java.util.concurrent.ExecutionException: java.lang.IllegalStateException: getVertexWriter: IOException occurred
	at java.util.concurrent.FutureTask.report(FutureTask.java:122)
	at java.util.concurrent.FutureTask.get(FutureTask.java:206)
	at org.apache.giraph.utils.ProgressableUtils.getResultsWithNCallables(ProgressableUtils.java:271)
	... 4 more
Caused by: java.lang.IllegalStateException: getVertexWriter: IOException occurred
	at org.apache.giraph.io.superstep_output.MultiThreadedSuperstepOutput.getVertexWriter(MultiThreadedSuperstepOutput.java:89)
	at org.apache.giraph.graph.ComputeCallable.call(ComputeCallable.java:153)
	at org.apache.giraph.graph.ComputeCallable.call(ComputeCallable.java:70)
	at org.apache.giraph.utils.LogStacktraceCallable.call(LogStacktraceCallable.java:67)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException): Failed to CREATE_FILE /user/kbhit/1569143060/_temporary/1/_temporary/attempt_1569140710858_0005_m_000001_1/step-0/part-m-00001 for DFSClient_NONMAPREDUCE_-1931685888_1 on 10.164.0.19 because DFSClient_NONMAPREDUCE_-1931685888_1 is already the current lease holder.
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:2412)
	at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.startFile(FSDirWriteFileOp.java:357)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2309)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2230)
	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:745)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:413)
	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:503)
	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)
	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:871)
	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:817)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893)
	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2606)

	at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1507)
	at org.apache.hadoop.ipc.Client.call(Client.java:1453)
	at org.apache.hadoop.ipc.Client.call(Client.java:1363)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:227)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
	at com.sun.proxy.$Proxy10.create(Unknown Source)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.create(ClientNamenodeProtocolTranslatorPB.java:297)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
	at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
	at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
	at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
	at com.sun.proxy.$Proxy11.create(Unknown Source)
	at org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:267)
	at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1206)
	at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1148)
	at org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:480)
	at org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:477)
	at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
	at org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:477)
	at org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:418)
	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1067)
	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1048)
	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:937)
	at org.apache.giraph.io.formats.GiraphTextOutputFormat.getRecordWriter(GiraphTextOutputFormat.java:67)
	at org.apache.giraph.io.formats.TextVertexOutputFormat$TextVertexWriter.createLineRecordWriter(TextVertexOutputFormat.java:116)
	at org.apache.giraph.io.formats.TextVertexOutputFormat$TextVertexWriter.initialize(TextVertexOutputFormat.java:97)
	at edu.agh.iga.adi.giraph.direction.io.StepVertexOutputFormat$IdWithValueVertexWriter.initialize(StepVertexOutputFormat.java:80)
	at org.apache.giraph.io.internal.WrappedVertexOutputFormat$1.initialize(WrappedVertexOutputFormat.java:82)
	at org.apache.giraph.io.superstep_output.MultiThreadedSuperstepOutput.getVertexWriter(MultiThreadedSuperstepOutput.java:87)
	... 7 more
End of LogType:task-3-stdout.log.
***************************************

Remove all forms of boxing

Right now some streams do .boxed() which is unacceptable for memory efficiency reasons. Make sure to remove it all.

Don't store matrices needlessly

The first time we need the big X matrices are only once we perform backwards substitutions. At any given point of time we only need the parent level matrices and the child level matrices, so we can create new ones as we go while deleting the old ones. This can provide a substantial memory optimisation benefit.

Shrink matrices at the leaves and elsewhere

Right now we are using the matrices of same size everywhere.
This does not make sense, as at the leaves this is not 6x6 but 3x3 only (to make matter worse for X and B as well, which tend to have a lot of columns).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.