darma-tasking / lb-analysis-framework Goto Github PK

View Code? Open in Web Editor NEW

3.0 3.0 1.0 32.99 MB

Analysis framework for exploring, testing, and comparing load balancing strategies

License: Other

Python 96.36% Perl 0.38% Dockerfile 0.35% Shell 0.08% CMake 0.03% Fortran 2.81%

concurrency distributed-computing hpc load-balancing parallelism

lb-analysis-framework's People

Contributors

Stargazers

Watchers

Forkers

cheelee

lb-analysis-framework's Issues

Integrate communication graph to data and execution models of LBS

This follows #4 in particular.
@lifflander

Make exporting of Exodus files optional

Currently if PV is not installed on the system the NodeGossiper fails.

Deprecate VTK direct dependency

Switch to Paraview-embedded VTK library.
Protect Paraview import against non-availability.

Ensure that gossiping round never results in self-informing

This was not explicitly disallowed in the original paper, but we think self-informing is a waste of time.

Some class(es) having missing color preambles

At least lbsLoadReaderVT but maybe others too (to be verified).

Ensure compatibility of AVI generator with ParaView 9

It seems that there are some backwards compatibility problems

Make sure NodeGossiper does not crash when ParaView is not found

We can safely assume that VTK proper remains a requirement because the VTK graph viz features we need are not part of the ParaView distributed by Kitware
However we can also assume that not everyone will have ParaView on their systems and in this case the NodeGossiper should still run. It should just not generate the ParaView visualizations.

Change factory method to be static members of base classe

This will result in removal of lbsCriterion helper class and its replacement by a call on the factory method on the base class in the lbsRuntime.

Extend VT load reader to include communications

Currently this reader skips communication lines in VT traces

Add modeling between LB stats input and strategy evaluation

The vt runtime has a layer of load modeling between the raw instrumented data about each object's workload and the values used in the LB strategy implementations.

We expect this to become more critical with the strongly disparate subphase structure of execution and the load imbalances therein in EMPIRE. The load models are being used for computing a scalar load value to feed into the strategies from the vector of per-subphase loads, and we want to be able to experiment with how this ought to be done.

Implement LBS reader for VT traces

The goal of this issue is to add a reader to the LBS (in its IO directory), that will be able to ingest VT traces and populate an initial lbsEpoch with these.

Current capability is limited to populating the initial lbsEpoch with pseudo-random sources of objects and processor assignments (uniform or log-normal).

Study ordering of objects traversed during transfer phase

Check why weights seem to not be working in current version

[Statistics] Descriptive statistics of communication weights:
	cardinality: 0  sum: nan  imbalance: nan
	minimum: nan  mean: nan  maximum: nan
	standard deviation: nan  variance: nan
	skewness: nan  kurtosis excess: nan

Write test benchmarks/mini-apps in VT for distributed LB

Compare to existing LBs (HierarchicalLB, GreedyLB) in VT

Define measure of persistence

The main goal of this issue is to determine when it the "persistence" assumption (needed for statistically-based distributed LB) is satisfied so such LB can be efficiently performed

Create ParaView python script to automatically generate PNG images of LB results

The goal of this issue is to create an automatic visualizer for NodeGossiper outputs producing visualizations similar to the images shown below:

Add support for subphases from vt applications

See DARMA-tasking/vt#708 for details of why this is desired, and the code that produces the stats files containing per-subphase timing data.

Rebase develop at commit 1427fc731e525c30ef69e06ba1df21bfa85192e3

This is to be done AFTER the subsequent commits have been moved to a WIP branch

Outputting zoomed in PNG

python ./src/Applications/NodeGossiper.py -o 128 -x 8 -y 8 -z 1 -t uniform,1.0,10.0 -w uniform,1.0,10.0 -k 5 -f 2 -i 3 -p 10 -c 2 -d 3 -e

diff --git a/src/Applications/AnimationViewer.py b/src/Applications/AnimationViewer.py
index ba0dd03..3860279 100644
--- a/src/Applications/AnimationViewer.py
+++ b/src/Applications/AnimationViewer.py
@@ -50,7 +50,7 @@ class AnimationViewer(ParaviewViewer):
         super(AnimationViewer, self).__init__(exodus, file_name, viewer_type)

     ###########################################################################
-    def saveView(self, reader):
+    def saveView(self, reader, view):
         """Save animation
         """

@@ -67,11 +67,14 @@ class AnimationViewer(ParaviewViewer):
             + "[AnimationViewer] "
             + bcolors.END
             + "###  Generating AVI animation...")
-        pv.WriteAnimation(self.file_name+".avi",
-                       Magnification=1,
-                       Quality = 2,
-                       FrameRate=1.0,
-                       Compression=True)
+        filename = "{}.avi".format(self.file_name)
+        pv.SaveAnimation(filename)
+        # pv.WriteAnimation(self.file_name+".avi",
+        #                Magnification=1,
+        #                Quality = 2,
+        #                viewOrLayout=view,
+        #                FrameRate=1.0,
+        #                Compression=True)

Update all local information on object migration

When an object is migrated from a sending processor to a receiving one, the former should update all its information about known underloaded potential targets and their respective (under-loads). This shall drastically improve picking especially for the case using cached loads.

Add option to save results elsewhere than in local execution directory

Ideally a new subdirectory would be created that would contain all outputs:

Vt object maps (vom files)
Exodus file
PNG images
AVI animation

Fix incorrect empirical CMF computation under new criterion

Current code may result in negative CMF values as underloaded ranks that become overloaded with the improved transfer criterion are not removed from the list of possible targets.

We may want to keep them however so they can still offer useful targets with criterion "6 prime". But in that case the CMF computation becomes incorrect.

Thanks @nlslatt for the catch!

compare relative effects of number of rounds vs number of iterations

Time to solution is almost identical in both cases (~52s) and iteration indices were renormalized on this basis. Here we can see that, ceteris paribus, 10 LB iterations with 2 gossip rounds each yield a better outcome than 2 LB iterations with 10 gossiping rounds each:

Port LBAF to Python3

Reference: https://docs.python.org/3/howto/pyporting.html

Modify MoveCountsViewer to take parameters as command line arguments

The goal of this issue is to replace all hard-coded parameter settings in this utility, such as:

# Number of processors
n_p = 8

file_name = "NodeGossiper-n8-lstats-i5-k4-f4-t1_0.0.{}.vom".format(i)

with command-line arguments (e.g., -i <input-VOM-prefix> -p <number-of-processors>).

Ensure that NodeGossiper does not fail when no edges are present in real data

Currently this is causing failures and this is not good

Create a new command-line flag to specify the suffix/extension of vt trace files

Currently we only let the user specify the prefix, and we assume that the extension is always ".vom". However vt outputs ".out" stats, which forces us to do file manipulation prior to running LBAF on those.

In order to further automate the process and therefore LBAF understand one additional and optional flag like -e <extension> with a default at '' (because there could also be no extension at all". Note that in this setting we would need to pass, e.g., -e ".out" and not just -e out.

NB: this means that the current -e (for exodus outputs) must be changed to something else: I suggest -m for (Mesh outputs).

Rewrite lost `StrictLocalizingCriterion` script

StrictLocalizingCriterion script has been lost -- probably not staged during commit

Supported extended VOM format

Example in dev210112TS-gossiptrials-100n4-gossip-full-stats

Add communication graph to data structures

Setup CI using GitHub Actions for analysis framework

We should set up some docker files to build containers for testing. Then we can launch those containers in GitHub Actions.

There are several levels of testing to accomplish:

Obtain all deps (VTK, etc). for python scripts
Run LB simulator with inputs and check it runs to end without assertion failures
Verify simulator correctness with input decks evaluating the quality of distributions produced by LB

Add new data to experiment and compare LBAF vs native VT LB

This is a generic issue to be used when pushing new experimental data

Move all subphase developments from Braden to WIP branch

All commits made onto develop AFTER commit 1427fc7 (last commit from @ppebay to be included) onto develop must be moved to a new WIP branch called "subphase-support" for later inspection and revisions

Use "overloaded viewers" to bias migration candidates sampling

The goal of this issue is to use the previously implemented #50 computation of "viewers" of underloaded processors from overloaded ones to better steer the picking of candidates for object migration (currently only based on under-load values).

Fix import bug

Step to reproduce:

$ python NodeGossiper.py -l ../../data/dev210112TS4-gossiptrials-printstatslboff-100n4-gossip-full-stats-0/stats  -x 20 -y 20 -z 1 -s 1 -k 2 -f 400 -i 8 -c 1

Traceback (most recent call last):
File "NodeGossiper.py", line 55, in
from src.Model.lbsPhase import Phase
ModuleNotFoundError: No module named 'src'

Odd NodeGossiper behavior when VTK missing

When trying to run NodeGossiper without VTK, the following error message is displayed:

*  ERROR: Could not write to ExodusII file by lack of VTK

But still tries to get further making LBAF crash:

Traceback (most recent call last):
  File "NodeGossiper.py", line 580, in <module>
    params.verbose)
  File "C:\dev\git\LBAF\src\IO\lbsWriterExodusII.py", line 118, in write
    n_p = len(self.phase.processors)
AttributeError: WriterExodusII instance has no attribute 'phase'

Such case should be properly handled by avoiding crashing.

Write docker-compose for interactive running

Make extension of VOM files a command line option

By default it should be .vom

Ensure complete compatibility with Python 3

Also clean up legacy Python 2 imports etc.

Implement "Work" in objective function

This is to integrate in the same concept both notions of Load and Communications and provide a unified framework.

@lifflander

Create option to cache aggregated per-processor load

This is to allow for a better simulation of asynchronous LB, where aggregated loads are not updated in real-time during the migration phase.

Use colors to make standard output more legible

All line preambles (contained between square brackets, e.g. [lbsStatistics]) should appear in a color different than that of the subsequent text.

Implement baseline gossip load balancer in VT

Add this as a new strategy to the LB suite in VT.

Write efficient comm-aware criterion and hybrid load/comm optimizer

The goal of this issue is two-fold:
(1) replace the naive, first implementation of a communication-only criterion (StrictLocalizer) with one that allows for the transfer of locally-communication objects iff this results in better locality on the target processor;
(2) extend the main optimizer loop logic to take into account communication costs (and not only loads)

@lifflander @nlslatt