usc-cloud / goffish Goto Github PK

USC GoFFish Graph Analytics Framework

Java 48.37% Game Maker Language 46.95% Python 0.29% C 4.04% Shell 0.35%

goffish's Introduction

Overview

Sensors and online instruments performing high fidelity observations are contributing in a large measure to the growing big data analytics challenge. These datasets are unique in that they represent events, observations and activities that are related to each other while being recorded by independent data streams. GoFFish (Graph-Oriented Framework for Foresight and Insight using Scalable Heuristics) is a scalable graph-oriented analytics framework well suited for processing reservoirs of interconnected distributed data fed by event data generators. It minimizes the communication overhead by grouping together tightly bind data.

A printable executive summary can be found here

Objectives

Efficiently store interconnected data by using a specialized graph-oriented file system:

Support for widely used graph formats such as GML
Take advantage of the various graph information and layout to enable efficient data loading
Facilitate storing temporal data to enable analytics on evolving graphs
Enable distributed information storage in-line with current trends observed in cloud systems

Efficiently compose graph analytics on large datasets by grouping data together:

Adopt a sub-graph centric approach which performs more computations locally and reduces communication overhead
Offer high level programming constructs at sub-graph level that hide low level graph details from the developer
Customizable output format that can be mapped to any format needed by visualization tools

Enable fast analytics on certain classes of analytics and graphs:

Performance improvement for analytics that require sub-graph knowledge, e.g., community analytics, high impact nodes, shortest paths
Suited for graphs that can be partitioned evenly across processors with minimal communication between them, e.g., sparse graphs (road networks), Internet/network graphs

Benefits

GoFFish offers several key benefits to developers…

Conversion pipeline from widely used graph formats such as GML
Specialized graph-oriented file system to easily store interconnected evolving data
High level API for composing graph analytics on static and evolving graphs
Easily customizable output for interfacing with various visualization tools

…and to analysts:

Support for multiple graphs and analytics
Drive analytics on evolving graphs
Ability to incorporate new analytics

Specifications

GoFFish is designed as a layered architecture with two main components: the GoFS graph file system and the Gopher analytics abstraction on top of it.

GoFS

A conversion pipeline from the GraphML format to the GoFS file format allows easy conversion of any graph. The conversion process relies on a graph partitioning stage at which point the graph is split into sub-graphs to balance its subsequent execution on distributed cloud infrastructures.

GoFS is very versatile with respect to the graphs it can integrate. Numerous graphs ranging from road networks to social networks have been successfully converted.

Gopher

GoFFish uses a high level API to intuitively and rapidly compose graph and event analytical models. The composed application enhances data parallel analytics beyond the traditional Map Reduce models using a novel distributed data partitioning approach based on edge distance heuristics. This allows unprecedented insight from the reservoirs of evolving data for commanders to perform causal graph analysis and strategic planning.

Measures of effectiveness

We evaluated our platform on a configuration representative for both clouds and commodity clusters. The system comprised of cluster of 12 nodes, each with an 8-core Intel Xeon CPU, 16 GB RAM, 1 TB SATA HDD, and connected by Gigabit Ethernet. We have compared the platform against the main competitor, Apache Giraph. Both systems were installed using Java 7 JRE for 64 bit Ubuntu Linux. The datasets consisted of three real world graphs: California road network (1.6M nodes x 2.7M edges), a network trace route (20M nodes x 23M edges), and the Live Journal social network (5M nodes x 65M edges). Different graph analytics such as connected components, shortest path and page rank were deployed on it and their speed-ups were measured. An average improvement of 10x was observed.

Required Skill Sets

To use the framework on a given data set:
Required
- Familiarity with Linux
- Manipulating graph data (potentially to convert the given data to GML format)
- Java programming skills
Good to have
- Familiarity with Virtual Box and virtualization to quickly deploy using the quick guide
- Experience with the BSP programming model
To setup the environment on a cluster
- Cluster administration knowledge
- Linux cluster administration skills

How to get it

The framework is hosted in the GitHub repository at https://github.com/usc-cloud/goffish

Clone the repository using:

git clone https://github.com/usc-cloud/goffish.git

Note: You may need to install a git client to download the repository.

Current source code is located in the goffish/goffish-trunk directory

Installation

A quick start guide can be found here together with a precompiled VM to help you get started.

GoFFish consists of a Graph File System (GoFS) which can be used as a standalone product. Detailed deployment documentation can be found here

Gopher is the distributed subgraph centric programming framework of GoFFish. Gopher programming API overview and an example can be found here. Deployment details of Gopher can be found [here] (goffish-trunk/gopher/docs/GopherdeploymentGuide.pdf)

User and Development Discussions

All GoFFish development discussions and user discussions were movied to an open forum. If you have any questions regarding GoFFish or if you want to contribute; Please join our discussion forum [here] (http://groups.google.com/d/forum/usc-cloud)

Future enhancements

Numerous enhancements are in progress, most of them related to dealing with detecting online events that can occur in evolving streaming graphs. The loop between insight and foresight will be closed by coupling event patterns mined from historical stream reservoirs with graph analytics based on real-time event streams from sensors.

goffish's People

Contributors

Stargazers

Watchers

Forkers

wangshaohua studentx gskman lazycrazyowl codeaudit mamuncse30 leecarraher vishalbelsare ram2012k hookk

goffish's Issues

NameNode implementation

Define what INameNode responsibilities actually are, and implement. Convert metis format to gml after partitionimg. Send request to partitiom daemon to pull gml partitiom files from name node. Create hash for vertices with remote exges.

print partition utility

User & Design Documentation - Gopher

Work with Alok.

Applications on XData Datasets - Gopher

Work with Nam and Charith
Perf Comparison of 1-3 external applications, e.g. Vector Corr, Community Detection?

Lookup local remote node subgraphs in reasonable time

Resolve dependencies between writeTemplate and writeInstances

GML files should be handled in a streaming fashion whenever possible to reduce memory requirements

Serialization Implementations (Protobuf)

Allow IInstanceSerializablePartition to groups instances for serialization

We should push the problem of the slice manager trying to group horizontally sliced instances (from GML, etc) and vertically slice them into each property to IInstanceSerializablePartition. Allow IInstanceSerializablePartition to return groups of instances at a time, rather than just one instance, so that the partition implementation is responsible for returning instances that A) can be grouped together B) will all fit in memory together.

Rewrite GML parser

GML parser ignores newlines, and is prone to deep stack exceptions on malformed GML files. Rewrite of parser is necessary, it may be worth looking into parser generators, but this is likely overkill for now.

Gopher end-to-end with GoFS

User and Design Documentation - GoFS

Work with Charith, Alok.
Javadoc API, Architecture,

Decision rationale. Docs on tools, commandline, deployment, prereqs, quick start.

Document GML extensions

Document our extensions to GML, and how we parse our custom GML files.

distributed stream partitioning

Gopher v0.9 Build & Packaging

GoFS v0.1 Build & Packaging

Integration test for instance data roundtrip

We need an integration test for instance roundtripping to/from disk.

Unit Tests - GoFS

Run GoFS on XDATA Cluster and upload Datasets

Follow up with Sotera about the process to integrate GoFS with the XDATA cluster.

Jython Library to GoFS

To close finish.

documentation - Jython integration (dependencies, compile etc.),
Sample code (including Python wrappers for GMLPartition and Slice manager)

Hookup Stream Partitioner

Hsuan-Yi to help with this

Review subgraph APIs (especially read instances)

Review timeseires subgraph APIs. Provide wrappers (utility function) to allow access to vertices and edges for individual instances. Work with Charith.

Multiple Graphs on the systems

Graph namespace
logical partition name
host name for partition

daemon port number

Sample application and benchmarks using Giraph

GoFS v0.9 Build & Packaging

Kryo deserialization from java.lang.Integer to primitive type int workaround

https://code.google.com/p/kryo/issues/detail?id=113

Write DIMACS parser for partitions

Support DIMACS graph format as well as GML.

GML list support

Sample Applications - Gopher

Start with vertex centric timeseries. Later, subgraph centric timeseries. E.g. License plate detection.

RESTful name node server

HDFS as storage layer

Separate GoFS user api from GoFS implementation

Santosh to take a stab at this. Ability to compile gopher apps using just gofs api jar.

Assign real serialization ids to Java types.

NPE when Iterating over instances

Sample code
I m getting this at subIt.hasNext()

Iterable instances = sliceManager.readInstances(subgraph,
endTime - 5 * 60 * 1000, endTime, nodeProperties, edgeProperties);

                Iterator<ISubgraphInstance> subIt = instances.iterator();
                if (subIt.hasNext()) {

                   -----------------
               }

Trace

java.lang.NullPointerException
at edu.usc.pgroup.goffish.gofs.slice.FileStorageManager.translateUUIDTo
ile(FileStorageManager.java:39)
at edu.usc.pgroup.goffish.gofs.slice.FileStorageManager.getReadStream(F
leStorageManager.java:23)
at edu.usc.pgroup.goffish.gofs.slice.SliceManager.readPropertyInstances
lice(SliceManager.java:418)
at edu.usc.pgroup.goffish.gofs.slice.InstanceIterator.advanceToNext(Ins
anceIterator.java:69)
at edu.usc.pgroup.goffish.gofs.slice.InstanceIterator.advanceToNext(Ins
anceIterator.java:13)
at edu.usc.pgroup.goffish.gofs.util.AbstractWrapperIterator.hasNext(Abs
ractWrapperIterator.java:34)
at edu.usc.pgroup.goffish.gopher.sample.CarTracer.compute(CarTracer.jav
:81)
at edu.usc.pgroup.floe.applications.gopher.BSPProcessorPellet$GraphTask
unner.run(BSPProcessorPellet.java:352)
at java.util.concurrent.ForkJoinTask$AdaptedRunnable.exec(ForkJoinTask.
ava:1265)
at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:334)
at java.util.concurrent.ForkJoinWorkerThread.execTask(ForkJoinWorkerThr
ad.java:604)
at java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:784)
at java.util.concurrent.ForkJoinPool.work(ForkJoinPool.java:646)
at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.j
va:398)

Perf Benchmarks

Different graph sizes, # of partitions, kryo/protobuf, read/write. preference to XDATA graph dataset. Run on Tsangpo.

Unit Tests - Gopher

GoFS/GML integration with Gephi Viz

GoFS v0.99 Build & Packaging

Gopher APIs

Vertex centric
subgraph centric
Time series

GML to JSON (2 instances)

Figure out GitHub Issue tracking!

Communication Layer and Distributed GoFS

Add capability to look up subgraph id from a remote vertex

Currently We can't look up the subgraph id from the vertex. As a result in Gopher currently we route messages with partition id and remote vertex it to route between sub-graphs.

Ideally subgraph centric programming abstractions should provide the capability to direct communication between subgroups.

Doing this Lookup at Gopher level will be costly. So we need some mapping to do the this look up from the local node.