kboom / iga-adi-giraph Goto Github PK

Distributed Isogeometric Alternating Directions Implicit Solver on Apache Giraph

Java 43.42% Shell 2.89% JavaScript 0.01% TypeScript 50.55% R 3.14%

giraph isogeometric-analysis alternating-direction-method implicit

iga-adi-giraph's Introduction

IGA-ADI Giraph Solver

Prerequisites

You have to have JDK 11 installed to be able to compile the project. You might want to use SdkMan to manage this. You also have to have mvn 3.5.3 installed in your system. For processing the results you need Excel and Node 12.10.0. You might want to use nvm (node version manager) to help you.

How to run

This solver can be run in the cloud in the matter of minutes. The scripts in this repository are prepared for Google Cloud Platform (GCP), although it would work in the similar way in any cloud. In fact, it has been tested in Azure and AWS.

First, you have to create appropriate Hadoop cluster. In case of GCP it is called Dataproc.

Modify one of two scripts to match your needs:

bin/local/create.cluster.sh, good for running experiments
bin/local/create.singlenode.cluster.sh, good for testing the setup

The most important options there are:

--master-machine-type=n1-standard-4, which selects the node type for the master
--worker-machine-type=n1-standard-8, which selects the node type for the workers
--master-min-cpu-platform="Intel Skylake", which selects the minimum CPU platform for the master
--worker-min-cpu-platform="Intel Skylake", which selects the minimum CPU platform for the workers
--num-workers=4, which selects the number of workers

Once you modify the script according to your liking execute it and wait for the cluster to be created. Next, you can issue the command that will package the solver and then publish it along with all necessary scripts into the master node of your newly created cluster.

./bin/local/publish.cloud.sh <your master instance number>

where <your master instance number> by default is iga-adi-m

Then you need to connect to the instance.

./bin/local/connect.sh <your master instance number>

where <your master instance number> by default is iga-adi-m

Running the experiments

Once you’re connected you have multiple ways of running the experiments. You can run a series of experiments, one for each value of the modified parameter like in the following example:

IGA_PROBLEM_SIZE=3072 \
IGA_WORKERS=4 \
IGA_WORKER_MEMORY=8 \
IGA_WORKER_CORES=8 \
IGA_STEPS=2 \
RUNS=1 \
SUITE_NAME=my-experiment-name \
IGA_CONTAINER_JVM_OPTIONS="-XX:+PrintFlagsFinal -XX:+UnlockDiagnosticVMOptions -XX:+UseParallelGC -XX:+UseParallelOldGC" \
./run.suite.sh IGA_WORKERS 4 2 1

In here, all values passed as environment variables are static in this suite, and we change the number of the workers used by changing the variable IGA_WORKERS using first 4, then 2 and 1 at the end. You can use any variable in here provided you want to fix the number of workers.

You may want to change the value of SUITE_NAME to introduce some order into your result files as they will be catalogued based on this name.

You can also define your own test suites in and keep them in the repository. See bin/suites for the details. For instance, bin/cluster/suites/03-explicit-configs-to-run.sh provides a list of explicit configs to run in a sequence.

./suites/<your suite>.sh

Once the experiment is complete, make sure to retrieve your results to your local machine before you delete your cluster. Issue the following command from your local machine.

./bin/local/retrieve.cloud.sh <your master instance number>

where <your master instance number> by default is iga-adi-m

Processing the results

This repository contains a number of scripts which allow computing various statistics using the results generated in the experiments and visualising their properties.

Once you retrieve your results look into the logs directory. There will be a separate directory for each run in the suite of your experiments. It will start with the suite name, followed by the unique application identifier.

In order to process the results aggregate them in the structure similar to what is in results-sample directory, that is group by problem size.

Once you do this, you should be able to run

./results-external-extraction/extract-suite.sh <the directory of your suite>

where <the directory of your suite> is the base directory of the directories which correspond to your problem sizes. That will generate the CSV file to the console. Copy it over into the template Excel file located under results-external-extraction/scalability_template.xlsx. You might need to use a regular "text into columns" functionality to make it fill individual cells correctly. This will calculate speedup and other global metrics for you which are necessary for some visualisations. Save it in a separate file.

Most times you will also need to learn the internals of your experiments - what was happening in the cluster. For that, you need to run:

node build/main/index.js -i <the path to your simulations directory> -o <the path to the output excel file>

This should produce an Excel file with many rows and columns, each describing a particular superstep for all experiments.

Finally, using these two Excel files, you can regenerate images. Do this by executing:

./results-charts/regenerate-images.sh <SCALABILITY_XLSX> <SUPERSTEPS_XLSX>

This will take some time depending on the number of your experiments (sometimes even an hour). The images will be continuously generated under results-charts/out directory.

iga-adi-giraph's People

Contributors

Watchers

iga-adi-giraph's Issues

Create Heat transfer problem

Store matrices in native column-major format

Index the Gaussian coefficients from 0 not from 1

Allow odd worker count

For now odd worker counts cause the problem with partitioning

Exception in thread "org.apache.giraph.master.MasterThread" java.lang.IllegalStateException: java.lang.ArithmeticException: mode was UNNECESSARY, but rounding was necessary
	at org.apache.giraph.master.MasterThread.run(MasterThread.java:201)
Caused by: java.lang.ArithmeticException: mode was UNNECESSARY, but rounding was necessary
	at com.google.common.math.MathPreconditions.checkRoundingUnnecessary(MathPreconditions.java:81)
	at com.google.common.math.IntMath.log2(IntMath.java:91)
	at edu.agh.iga.adi.giraph.direction.PartitioningStrategy.partitioningStrategy(PartitioningStrategy.java:40)
	at edu.agh.iga.adi.giraph.direction.io.IgaTreeSplitter.allSplitsFor(IgaTreeSplitter.java:25)
	at edu.agh.iga.adi.giraph.direction.io.InMemoryStepInputFormat.getSplits(InMemoryStepInputFormat.java:77)
	at org.apache.giraph.io.internal.WrappedVertexInputFormat.getSplits(WrappedVertexInputFormat.java:72)
	at org.apache.giraph.master.BspServiceMaster.generateInputSplits(BspServiceMaster.java:329)
	at org.apache.giraph.master.BspServiceMaster.createInputSplits(BspServiceMaster.java:624)
	at org.apache.giraph.master.BspServiceMaster.createVertexInputSplits(BspServiceMaster.java:668)
	at org.apache.giraph.master.MasterThread.run(MasterThread.java:113)

Ensure OutputFormat thread safety

If VERTEX_OUTPUT_FORMAT_THREAD_SAFE is set to true and there are multiple threads set in NUM_COMPUTE_THREADS (so in NUM_OUTPUT_THREADS by default as well too) then we get:

2019-09-22 09:04:53,987 ERROR [org.apache.giraph.utils.LogStacktraceCallable] - Execution of callable failed
java.lang.IllegalStateException: getVertexWriter: IOException occurred
	at org.apache.giraph.io.superstep_output.MultiThreadedSuperstepOutput.getVertexWriter(MultiThreadedSuperstepOutput.java:89)
	at org.apache.giraph.graph.ComputeCallable.call(ComputeCallable.java:153)
	at org.apache.giraph.graph.ComputeCallable.call(ComputeCallable.java:70)
	at org.apache.giraph.utils.LogStacktraceCallable.call(LogStacktraceCallable.java:67)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException): Failed to CREATE_FILE /user/kbhit/1569143060/_temporary/1/_temporary/attempt_1569140710858_0005_m_000001_1/step-0/part-m-00001 for DFSClient_NONMAPREDUCE_-1931685888_1 on 10.164.0.19 because DFSClient_NONMAPREDUCE_-1931685888_1 is already the current lease holder.
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:2412)
	at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.startFile(FSDirWriteFileOp.java:357)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2309)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2230)
	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:745)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:413)
	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:503)
	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)
	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:871)
	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:817)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893)
	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2606)

	at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1507)
	at org.apache.hadoop.ipc.Client.call(Client.java:1453)
	at org.apache.hadoop.ipc.Client.call(Client.java:1363)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:227)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
	at com.sun.proxy.$Proxy10.create(Unknown Source)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.create(ClientNamenodeProtocolTranslatorPB.java:297)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
	at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
	at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
	at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
	at com.sun.proxy.$Proxy11.create(Unknown Source)
	at org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:267)
	at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1206)
	at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1148)
	at org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:480)
	at org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:477)
	at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
	at org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:477)
	at org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:418)
	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1067)
	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1048)
	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:937)
	at org.apache.giraph.io.formats.GiraphTextOutputFormat.getRecordWriter(GiraphTextOutputFormat.java:67)
	at org.apache.giraph.io.formats.TextVertexOutputFormat$TextVertexWriter.createLineRecordWriter(TextVertexOutputFormat.java:116)
	at org.apache.giraph.io.formats.TextVertexOutputFormat$TextVertexWriter.initialize(TextVertexOutputFormat.java:97)
	at edu.agh.iga.adi.giraph.direction.io.StepVertexOutputFormat$IdWithValueVertexWriter.initialize(StepVertexOutputFormat.java:80)
	at org.apache.giraph.io.internal.WrappedVertexOutputFormat$1.initialize(WrappedVertexOutputFormat.java:82)
	at org.apache.giraph.io.superstep_output.MultiThreadedSuperstepOutput.getVertexWriter(MultiThreadedSuperstepOutput.java:87)
	... 7 more
2019-09-22 09:04:53,997 ERROR [org.apache.giraph.worker.BspServiceWorker] - unregisterHealth: Got failure, unregistering health on /_hadoopBsp/giraph_yarn_application_1569140710858_0005/_applicationAttemptsDir/0/_superstepDir/0/_workerHealthyDir/iga-adi-w-1.europe-west4-a.c.charismatic-cab-252315.internal_1 on superstep 0
2019-09-22 09:04:54,000 ERROR [org.apache.giraph.yarn.GiraphYarnTask] - GiraphYarnTask threw a top-level exception, failing task
java.lang.RuntimeException: run: Caught an unrecoverable exception Exception occurred
	at org.apache.giraph.yarn.GiraphYarnTask.run(GiraphYarnTask.java:106)
	at org.apache.giraph.yarn.GiraphYarnTask.main(GiraphYarnTask.java:184)
Caused by: java.lang.IllegalStateException: Exception occurred
	at org.apache.giraph.utils.ProgressableUtils.getResultsWithNCallables(ProgressableUtils.java:274)
	at org.apache.giraph.graph.GraphTaskManager.processGraphPartitions(GraphTaskManager.java:813)
	at org.apache.giraph.graph.GraphTaskManager.execute(GraphTaskManager.java:361)
	at org.apache.giraph.yarn.GiraphYarnTask.run(GiraphYarnTask.java:93)
	... 1 more
Caused by: java.util.concurrent.ExecutionException: java.lang.IllegalStateException: getVertexWriter: IOException occurred
	at java.util.concurrent.FutureTask.report(FutureTask.java:122)
	at java.util.concurrent.FutureTask.get(FutureTask.java:206)
	at org.apache.giraph.utils.ProgressableUtils.getResultsWithNCallables(ProgressableUtils.java:271)
	... 4 more
Caused by: java.lang.IllegalStateException: getVertexWriter: IOException occurred
	at org.apache.giraph.io.superstep_output.MultiThreadedSuperstepOutput.getVertexWriter(MultiThreadedSuperstepOutput.java:89)
	at org.apache.giraph.graph.ComputeCallable.call(ComputeCallable.java:153)
	at org.apache.giraph.graph.ComputeCallable.call(ComputeCallable.java:70)
	at org.apache.giraph.utils.LogStacktraceCallable.call(LogStacktraceCallable.java:67)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException): Failed to CREATE_FILE /user/kbhit/1569143060/_temporary/1/_temporary/attempt_1569140710858_0005_m_000001_1/step-0/part-m-00001 for DFSClient_NONMAPREDUCE_-1931685888_1 on 10.164.0.19 because DFSClient_NONMAPREDUCE_-1931685888_1 is already the current lease holder.
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:2412)
	at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.startFile(FSDirWriteFileOp.java:357)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2309)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2230)
	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:745)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:413)
	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:503)
	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)
	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:871)
	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:817)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893)
	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2606)

	at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1507)
	at org.apache.hadoop.ipc.Client.call(Client.java:1453)
	at org.apache.hadoop.ipc.Client.call(Client.java:1363)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:227)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
	at com.sun.proxy.$Proxy10.create(Unknown Source)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.create(ClientNamenodeProtocolTranslatorPB.java:297)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
	at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
	at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
	at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
	at com.sun.proxy.$Proxy11.create(Unknown Source)
	at org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:267)
	at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1206)
	at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1148)
	at org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:480)
	at org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:477)
	at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
	at org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:477)
	at org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:418)
	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1067)
	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1048)
	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:937)
	at org.apache.giraph.io.formats.GiraphTextOutputFormat.getRecordWriter(GiraphTextOutputFormat.java:67)
	at org.apache.giraph.io.formats.TextVertexOutputFormat$TextVertexWriter.createLineRecordWriter(TextVertexOutputFormat.java:116)
	at org.apache.giraph.io.formats.TextVertexOutputFormat$TextVertexWriter.initialize(TextVertexOutputFormat.java:97)
	at edu.agh.iga.adi.giraph.direction.io.StepVertexOutputFormat$IdWithValueVertexWriter.initialize(StepVertexOutputFormat.java:80)
	at org.apache.giraph.io.internal.WrappedVertexOutputFormat$1.initialize(WrappedVertexOutputFormat.java:82)
	at org.apache.giraph.io.superstep_output.MultiThreadedSuperstepOutput.getVertexWriter(MultiThreadedSuperstepOutput.java:87)
	... 7 more
End of LogType:task-3-stdout.log.
***************************************

Remove excessive logging for performance improvement

Run multiple steps in one go

Remove all forms of boxing

Right now some streams do .boxed() which is unacceptable for memory efficiency reasons. Make sure to remove it all.

Create scripts for automatic performance testing

Don't store matrices needlessly

The first time we need the big X matrices are only once we perform backwards substitutions. At any given point of time we only need the parent level matrices and the child level matrices, so we can create new ones as we go while deleting the old ones. This can provide a substantial memory optimisation benefit.

Use primitive double everywhere rather than Double

There is a leak of boxed doubles somewhere which puts tremendous pressure on the memory.

Shrink matrices at the leaves and elsewhere

Right now we are using the matrices of same size everywhere.
This does not make sense, as at the leaves this is not 6x6 but 3x3 only (to make matter worse for X and B as well, which tend to have a lot of columns).

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.