sparkling-graph / sparkling-graph Goto Github PK

SparklingGraph provides easy to use set of features that will give you ability to proces large scala graphs using Spark and GraphX.

Home Page: http://sparkling.ml

License: BSD 2-Clause "Simplified" License

Scala 100.00%

graph measure spark machine-learning comunity-detection-methods network-analysis graph-algorithms big-data vertex link-predication

sparkling-graph's Introduction

sparkling-graph

SparklingGraph provides easy to use set of features that will give you ability to proces large scala graphs using Spark and GraphX.

Requirements

Scala 2.11 or 2.12
Spark 2.4.0 (or compatible)

Versioning

Since commit 3246714 project is using git versioning (for example 0.0.7+140-32467140 or 0.0.7+140-32467140+20190402-2057-SNAPSHOT). All artifacts from now one will be published to snapshot without version overriding. New approach will also add abbility to reproduce each version. Release versions will use normal tag based approach.

Dependencies

Since commit 3246714 you can get artifacts for any master branch commits using git describe command.

Snapshot

resolvers +=  "Sonatype OSS Snapshots" at "https://oss.sonatype.org/content/repositories/snapshots"

// one or all from:
libraryDependencies += "ml.sparkling" %% "sparkling-graph-examples" % "0.0.8-SNAPSHOT"
libraryDependencies += "ml.sparkling" %% "sparkling-graph-loaders" % "0.0.8-SNAPSHOT"
libraryDependencies += "ml.sparkling" %% "sparkling-graph-operators" % "0.0.8-SNAPSHOT"

Release

// one or all from:
libraryDependencies += "ml.sparkling" %% "sparkling-graph-examples" % "0.0.7"
libraryDependencies += "ml.sparkling" %% "sparkling-graph-loaders" % "0.0.7"
libraryDependencies += "ml.sparkling" %% "sparkling-graph-operators" % "0.0.7"

Current features

Loading
- Formats:
  - CSV
  - GraphML
- DSL
Measures - measures can be configured to treat graphs as directed and undirected
- Measures DSL - easy to use domain specific language that boost productivity of library
- Graph
  - Modularity
  - Freeman's network centrality
- Vertex
  - Closeness
  - Local clustering
  - Eigenvector
  - Hits
  - Neighbor connectivity
  - Vertex embeddedness
  - Betweenness
    - Edmonds
    - Flow
    - Hua
- Edges
  - Adamic/Adar
  - Common neighbours
Comunity detection methods
- PSCAN (SCAN)
Graph coarsening
- Label Propagation based
Link prediction
- Similarity measure based
Generators
- Ring
- Watts And Strogatz
Experiments
- Describe graph using all measures to CSV files

Planned features

Loading
- GML
Measures
- Katz
Comunity detection methods
- Modularity maximization
- Infomap
More Generators
API
- Random walk
- BFS
ML
- Vertex classification

Used by

Supported by:

provides us awesome IDE

How to

Please check API, examples or docs

Citation

If you use SparklingGraph in your research and publish it, please consider citing us, it will help us get funding for making the library better. Currently manuscript is in preparation, so please us following references:

Bartusiak et al. (2017). SparklingGraph: large scale, distributed graph processing made easy. Manuscript in preparation.

@unpublished{sparkling-graph
title={SparklingGraph: large scale, distributed graph processing made easy},
author={Bartusiak R., Kajdanowicz T.},
note = {Manuscript in preparation},
year = {2017}
}

License

sparkling-graph's People

Contributors

Stargazers

Watchers

sparkling-graph's Issues

localClustering and eigenvectorCentrality Issues

// Local Clustering Score
val clusteringGraph: Graph[Double, Int] = g.localClustering(VertexMeasureConfiguration(treatAsUndirected=true))
val localCentralityRDD: VertexRDD[Double] = clusteringGraph.vertices

// Eigen Vector Score
val eigenvectorRDD: VertexRDD[Double] = g.eigenvectorCentrality(VertexMeasureConfiguration(treatAsUndirected=true)).vertices

I am trying to calculate those two measures for a graph. However, this problem is shown
value eigenvectorCentrality is not a member of org.apache.spark.graphx.Graph[Int,Int]
value eigenvectorCentrality is not a member of org.apache.spark.graphx.Graph[Int,Int]

please implement scan

https://gist.github.com/enjoylife/2289625

http://ualr.edu/nxyuruk/publications/kdd07.pdf

graphml writer

I just found your great package and your graphml loader https://github.com/sparkling-graph/sparkling-graph/blob/master/loaders/src/main/scala/ml/sparkling/graph/loaders/graphml/GraphMLLoader.scala

and wonder if a similar writer exists?

for gexf I have found the following code online but that will not play nice with gephi.

def toGexf[VD, ED](g: Graph[VD, ED]): String =
"<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n" +
      "<gexf xmlns=\"http://www.gexf.net/1.2draft\" version=\"1.2\">\n" +
      "  <graph mode=\"static\" defaultedgetype=\"directed\">\n" +
      "    <nodes>\n" +
      g.vertices.map(v => "      <node id=\"" + v._1 + "\" label=\"" +
        v._2 + "\" />\n").collect.mkString +
      "    </nodes>\n" +
      "    <edges>\n" +
      g.edges.map(e => "      <edge source=\"" + e.srcId +
        "\" target=\"" + e.dstId + "\" label=\"" + e.attr +
        "\" />\n").collect.mkString +
      "    </edges>\n" +
      "  </graph>\n" +
      "</gexf>"

Calculate Eigenvector centrality with weight edges

Hi everybody,

How do I can calculate Eigenvector Centrality with weight edges?

Thank you!

Louvian community detection method

Please implement Louvian community detection method

Fast unfolding of communities in large networks,
Vincent D Blondel, Jean-Loup Guillaume, Renaud Lambiotte, Etienne Lefebvre,
Journal of Statistical Mechanics: Theory and Experiment 2008 (10), P10008 (12pp)

Graph frames

Create abstraction over graphs and apropriate type classes in order to have ability to do computations on GraphX graphs, GraphFrames and in future others.

betweenness centrality index

I want to learn a bit more about big graph processing (https://github.com/geoHeil/graphFrameStarter) and would like to implement something a bit similar to betweenness centrality. This could be expanded to the type of connection (chat, e-mail), incoming /outgoing edge. But the following is in pseudocode what I would want to achieve as a starter:

for each vertex:
	calculate the quotient of 
             degree(of allOfItsConnections) and
	    the degree of connections to vertices with a certain label

for each vertex and its friends (ego network, 2-3 levels)
	calculate for each node:
		calculate the quotient of degree(of allOfItsConnections)
		and the dgree of special vertex types

	aggregate as average

do you think the idea of centrality is good for a starter?
What material would you suggest to get started. Maybe the result would be a nice fit four your library.

maven repo

maybe there is something wrong with your maven repository. I added the dependencies to my pom.xml, but the imports are wrong at ml.sparkling

Closeness centrality for a huge graph

Hi there,

I am using 2.12:66565565-SNAPSHOT version of the Sparkling, which is compatible with 2.12 version of Scala.

I have a single csv file of nodes and ~265M edges (4.5 Gb) and I am trying to load it into sparkling to calculate closeness centrality. Data is already in a numeric format. I encounter multiple things that I don't understand, and would like to understand what am I doing wrong:

Had to provide graph data type [Integer, Double] to run Closeness Centrality.

In the beginning I have he following code and I experiment with loading small graph (4 edges)

val filePath="s3_path"
val schema = StructType(
    StructField("vertex1", IntegerType, false) ::
    StructField("vertex2", IntegerType, false) :: Nil)
val graph = LoadGraph.from(CSV(filePath)).using(Schema(schema)).using(NoHeader).load()

Once this executed, I want to make sure that the data is loaded by calculating number of vertices graph.vertices.count, and it seems to work.

Once I call graph.closenessCentrality(VertexMeasureConfiguration(treatAsUndirected=true)) I get:

<console>:58: error: value closenessCentrality is not a member of org.apache.spark.graphx.Graph[Nothing,Nothing]

I figured out that I need to specify graph type and changed the line to:

val graph : Graph[Integer, Integer] = LoadGraph.from(CSV(filePath)).using(Schema(schema)).using(NoHeader).load()

It worked well, but the exception message changed to:

<console>:61: error: could not find implicit value for parameter num: Numeric[Integer]

The only configuration allowed me to run the Closeness Centrality was using Double in graph type

val schema = StructType(
    StructField("vertex1", IntegerType, false) ::
    StructField("vertex2", IntegerType, false) :: Nil)
    
val graph : Graph[Integer, Double] = LoadGraph.from(CSV(filePath)).using(Schema(schema)).using(NoHeader).load()

But this is kinda weird, because nodes are of the same type - they bith integer, so why should I convert graph to Graph[Integer, Double]?

Once I start to calculate closeness centrality, my Spark job fails with maximum waiting time is reached.

My questions are:

Is Sparkling-Graph supposed to deal with graphs of that size?
If yes, how big Spark cluster should be to perform closeness centrality (# of executorCores and executorMemory)?
Any hints how I can make it to work at all / work faster?

License change

Consider license change to less restrictive

There is not Betweenness API documentation

Hi
Is the API for Betweenness not implemented yet ? I tried to find info on how to use it but i can not find something.

Thanks in advance

Measures implicit methods

Add implicit methods for measures computation

val graph  = // graph creation
graph.hits()

Issue with Eigenvector Centrality?!

hi -
I have a question or issue with Eigenvector Centrality. I have a graph that I am able to create the results by using this calculation, but when I create a subgraph (or even manually importing as a new graph) the calculation seems to take longer and doesn't seem ever to finish.

The only thing that I've noticed is that before subgraphing, Freeman's centrality is <1 and when I do the subgraph Freeman's centrality is >1.

Not sure if you anybody has any pointers.

val eic = graph.eigenvectorCentrality().vertices
val evusers = users.join(eic).map {
  case (id, (username, eic)) => (username, eic)
}

Thanks in advance!

PSCAN cannot find communities

Hello, I try to find communities using PSCAN of sparking. Refering to the doc of PSCAN, I write the following codes :

val conf = new SparkConf().setAppName("pscan-test").setMaster("local")
implicit val ctx:SparkContext = new SparkContext(conf)

val filePath = "path_to_edgelist_file"
val graph:Graph[String, Int] = LoadGraph.from(CSV(filePath))
        .using(NoHeader).using(Delimiter(","))
        .load[String, Int]()
val components:Graph[ComponentID, Int] = graph.PSCAN(epsilon = 0.1)
println("num communities: " + components.vertices.map{case (vId,cId)=>cId}.distinct.count)
components.vertices.take(10).foreach(println)

The doc said that:

val components: Graph[ComponentID, Int] = graph.PSCAN(epsilon=0.5)
// Graph where each vertex is associated with its component identifier

But when I run above code, I find that, no matter how I tune the value of epsilon, the number of communities is always equals to the number of vertices and component identifier of every vertex is always the same as their vertex id.

I'm wondering whether I misunderstand the docs or there is something wrong with the PSCAN of sparking. Anybody can offer some help? Thanks in advance.

here is my edges file(karate club graph):

0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
0,10
0,11
0,12
0,13
0,17
0,19
0,21
0,31
1,2
1,3
1,7
1,13
1,17
1,19
1,21
1,30
2,3
2,7
2,8
2,9
2,13
2,27
2,28
2,32
3,7
3,12
3,13
4,6
4,10
5,6
5,10
5,16
6,16
8,30
8,32
8,33
9,33
13,33
14,32
14,33
15,32
15,33
18,32
18,33
19,33
20,32
20,33
22,32
22,33
23,25
23,27
23,29
23,32
23,33
24,25
24,27
24,31
25,31
26,29
26,33
27,33
28,31
28,33
29,32
29,33
30,32
30,33
31,32
31,33
32,33

add vertex itself to nodes' neighborhood in PSCAN

The neighborhood of each node in PSCAN should contain itself.

The neighborhood of a vertex is defined in the paper of SCAN:

DEFINITION 1 (VERTEX STRUCTURE)
Let v ∈ V, the structure of v is defined by its neighborhood,
denoted by Γ(v)
Γ(v) = {w ∈ V | (v,w) ∈ E} ∪ {v}
In Figure 1 vertex 6 is a hub sharing neighbors with two clusters.
If we only use the number of shared neighbors, vertex 6 will be
clustered into either of the clusters or cause the two clusters to
merge. Therefore, we normalize the number of common neighbors
by the geometric mean of the two neighborhoods’ size.

If vertex itself is not contained, similarities of all node pairs will be smaller than the expected value.

For example, if two connected vertices have no common neighbors, the similarity will be 0. Which is the same as node pairs that are not connected by an edge. This is apparently not correct.

[question] Modify EigenvectorCentrality

I have a question about modifying eigenvector centrality.

As @riomus pointed out here #10 I should have a look at EigenvectorCentrality and GraphX to properly implement the iterative pregel approach. After having a look at it and modifying the class here https://github.com/geoHeil/graphFrameStarter/blob/master/src/main/scala/myOrg/sparklingGraph/FraudCentrality.scala to calculate the fraudulent percentage some questions are still open for me (see the TODO notes /questions at the link).

It would be great if someone could help me to understand sparkling graph better in order to customize EigenvectorCentrality.