Giter VIP home page Giter VIP logo

sparkling-graph / sparkling-graph Goto Github PK

View Code? Open in Web Editor NEW
150.0 20.0 34.0 9.98 MB

SparklingGraph provides easy to use set of features that will give you ability to proces large scala graphs using Spark and GraphX.

Home Page: http://sparkling.ml

License: BSD 2-Clause "Simplified" License

Scala 100.00%
graph measure spark machine-learning comunity-detection-methods network-analysis graph-algorithms big-data vertex link-predication

sparkling-graph's Introduction

sparkling-graph

Build Status codecov Documentation Status Codacy Badge Maven Central MLOSS Spark Packages API Gitter FOSSA Status

SparklingGraph provides easy to use set of features that will give you ability to proces large scala graphs using Spark and GraphX.

Requirements

  • Scala 2.11 or 2.12
  • Spark 2.4.0 (or compatible)

Versioning

Since commit 3246714 project is using git versioning (for example 0.0.7+140-32467140 or 0.0.7+140-32467140+20190402-2057-SNAPSHOT). All artifacts from now one will be published to snapshot without version overriding. New approach will also add abbility to reproduce each version. Release versions will use normal tag based approach.

Dependencies

Since commit 3246714 you can get artifacts for any master branch commits using git describe command.

Snapshot

resolvers +=  "Sonatype OSS Snapshots" at "https://oss.sonatype.org/content/repositories/snapshots"
// one or all from:
libraryDependencies += "ml.sparkling" %% "sparkling-graph-examples" % "0.0.8-SNAPSHOT"
libraryDependencies += "ml.sparkling" %% "sparkling-graph-loaders" % "0.0.8-SNAPSHOT"
libraryDependencies += "ml.sparkling" %% "sparkling-graph-operators" % "0.0.8-SNAPSHOT"

Release

// one or all from:
libraryDependencies += "ml.sparkling" %% "sparkling-graph-examples" % "0.0.7"
libraryDependencies += "ml.sparkling" %% "sparkling-graph-loaders" % "0.0.7"
libraryDependencies += "ml.sparkling" %% "sparkling-graph-operators" % "0.0.7"

Current features

  • Loading
    • Formats:
      • CSV
      • GraphML
    • DSL
  • Measures - measures can be configured to treat graphs as directed and undirected
    • Measures DSL - easy to use domain specific language that boost productivity of library
    • Graph
      • Modularity
      • Freeman's network centrality
    • Vertex
      • Closeness
      • Local clustering
      • Eigenvector
      • Hits
      • Neighbor connectivity
      • Vertex embeddedness
      • Betweenness
        • Edmonds
        • Flow
        • Hua
    • Edges
      • Adamic/Adar
      • Common neighbours
  • Comunity detection methods
    • PSCAN (SCAN)
  • Graph coarsening
    • Label Propagation based
  • Link prediction
    • Similarity measure based
  • Generators
    • Ring
    • Watts And Strogatz
  • Experiments
    • Describe graph using all measures to CSV files

Planned features

  • Loading
    • GML
  • Measures
    • Katz
  • Comunity detection methods
    • Modularity maximization
    • Infomap
  • More Generators
  • API
    • Random walk
    • BFS
  • ML
    • Vertex classification

Used by

Supported by:

provides us awesome IDE

How to

Please check API, examples or docs

Citation

If you use SparklingGraph in your research and publish it, please consider citing us, it will help us get funding for making the library better. Currently manuscript is in preparation, so please us following references:

Bartusiak et al. (2017). SparklingGraph: large scale, distributed graph processing made easy. Manuscript in preparation.

@unpublished{sparkling-graph
title={SparklingGraph: large scale, distributed graph processing made easy},
author={Bartusiak R., Kajdanowicz T.},
note = {Manuscript in preparation},
year = {2017}
}

License

FOSSA Status

sparkling-graph's People

Contributors

fossabot avatar kajdanowicz avatar mizvol avatar mssemik avatar riomus avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sparkling-graph's Issues

localClustering and eigenvectorCentrality Issues

// Local Clustering Score
val clusteringGraph: Graph[Double, Int] = g.localClustering(VertexMeasureConfiguration(treatAsUndirected=true))
val localCentralityRDD: VertexRDD[Double] = clusteringGraph.vertices

// Eigen Vector Score
val eigenvectorRDD: VertexRDD[Double] = g.eigenvectorCentrality(VertexMeasureConfiguration(treatAsUndirected=true)).vertices

I am trying to calculate those two measures for a graph. However, this problem is shown
value eigenvectorCentrality is not a member of org.apache.spark.graphx.Graph[Int,Int]
value eigenvectorCentrality is not a member of org.apache.spark.graphx.Graph[Int,Int]

graphml writer

I just found your great package and your graphml loader https://github.com/sparkling-graph/sparkling-graph/blob/master/loaders/src/main/scala/ml/sparkling/graph/loaders/graphml/GraphMLLoader.scala

and wonder if a similar writer exists?

for gexf I have found the following code online but that will not play nice with gephi.

def toGexf[VD, ED](g: Graph[VD, ED]): String =
"<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n" +
      "<gexf xmlns=\"http://www.gexf.net/1.2draft\" version=\"1.2\">\n" +
      "  <graph mode=\"static\" defaultedgetype=\"directed\">\n" +
      "    <nodes>\n" +
      g.vertices.map(v => "      <node id=\"" + v._1 + "\" label=\"" +
        v._2 + "\" />\n").collect.mkString +
      "    </nodes>\n" +
      "    <edges>\n" +
      g.edges.map(e => "      <edge source=\"" + e.srcId +
        "\" target=\"" + e.dstId + "\" label=\"" + e.attr +
        "\" />\n").collect.mkString +
      "    </edges>\n" +
      "  </graph>\n" +
      "</gexf>"

Louvian community detection method

Please implement Louvian community detection method

Fast unfolding of communities in large networks,
Vincent D Blondel, Jean-Loup Guillaume, Renaud Lambiotte, Etienne Lefebvre,
Journal of Statistical Mechanics: Theory and Experiment 2008 (10), P10008 (12pp)

Graph frames

Create abstraction over graphs and apropriate type classes in order to have ability to do computations on GraphX graphs, GraphFrames and in future others.

betweenness centrality index

I want to learn a bit more about big graph processing (https://github.com/geoHeil/graphFrameStarter) and would like to implement something a bit similar to betweenness centrality. This could be expanded to the type of connection (chat, e-mail), incoming /outgoing edge. But the following is in pseudocode what I would want to achieve as a starter:

for each vertex:
	calculate the quotient of 
             degree(of allOfItsConnections) and
	    the degree of connections to vertices with a certain label

for each vertex and its friends (ego network, 2-3 levels)
	calculate for each node:
		calculate the quotient of degree(of allOfItsConnections)
		and the dgree of special vertex types

	aggregate as average

do you think the idea of centrality is good for a starter?
What material would you suggest to get started. Maybe the result would be a nice fit four your library.

maven repo

maybe there is something wrong with your maven repository. I added the dependencies to my pom.xml, but the imports are wrong at ml.sparkling

Closeness centrality for a huge graph

Hi there,

I am using 2.12:66565565-SNAPSHOT version of the Sparkling, which is compatible with 2.12 version of Scala.

I have a single csv file of nodes and ~265M edges (4.5 Gb) and I am trying to load it into sparkling to calculate closeness centrality. Data is already in a numeric format. I encounter multiple things that I don't understand, and would like to understand what am I doing wrong:

  1. Had to provide graph data type [Integer, Double] to run Closeness Centrality.

In the beginning I have he following code and I experiment with loading small graph (4 edges)

val filePath="s3_path"
val schema = StructType(
    StructField("vertex1", IntegerType, false) ::
    StructField("vertex2", IntegerType, false) :: Nil)
val graph = LoadGraph.from(CSV(filePath)).using(Schema(schema)).using(NoHeader).load()

Once this executed, I want to make sure that the data is loaded by calculating number of vertices graph.vertices.count, and it seems to work.

Once I call graph.closenessCentrality(VertexMeasureConfiguration(treatAsUndirected=true)) I get:

<console>:58: error: value closenessCentrality is not a member of org.apache.spark.graphx.Graph[Nothing,Nothing]

I figured out that I need to specify graph type and changed the line to:

val graph : Graph[Integer, Integer] = LoadGraph.from(CSV(filePath)).using(Schema(schema)).using(NoHeader).load()

It worked well, but the exception message changed to:

<console>:61: error: could not find implicit value for parameter num: Numeric[Integer]

The only configuration allowed me to run the Closeness Centrality was using Double in graph type

val schema = StructType(
    StructField("vertex1", IntegerType, false) ::
    StructField("vertex2", IntegerType, false) :: Nil)
    
val graph : Graph[Integer, Double] = LoadGraph.from(CSV(filePath)).using(Schema(schema)).using(NoHeader).load()

But this is kinda weird, because nodes are of the same type - they bith integer, so why should I convert graph to Graph[Integer, Double]?

Once I start to calculate closeness centrality, my Spark job fails with maximum waiting time is reached.

My questions are:

  1. Is Sparkling-Graph supposed to deal with graphs of that size?
  2. If yes, how big Spark cluster should be to perform closeness centrality (# of executorCores and executorMemory)?
  3. Any hints how I can make it to work at all / work faster?

Issue with Eigenvector Centrality?!

hi -
I have a question or issue with Eigenvector Centrality. I have a graph that I am able to create the results by using this calculation, but when I create a subgraph (or even manually importing as a new graph) the calculation seems to take longer and doesn't seem ever to finish.

The only thing that I've noticed is that before subgraphing, Freeman's centrality is <1 and when I do the subgraph Freeman's centrality is >1.

Not sure if you anybody has any pointers.

val eic = graph.eigenvectorCentrality().vertices
val evusers = users.join(eic).map {
  case (id, (username, eic)) => (username, eic)
}

Thanks in advance!

PSCAN cannot find communities

Hello, I try to find communities using PSCAN of sparking. Refering to the doc of PSCAN, I write the following codes :

val conf = new SparkConf().setAppName("pscan-test").setMaster("local")
implicit val ctx:SparkContext = new SparkContext(conf)

val filePath = "path_to_edgelist_file"
val graph:Graph[String, Int] = LoadGraph.from(CSV(filePath))
        .using(NoHeader).using(Delimiter(","))
        .load[String, Int]()
val components:Graph[ComponentID, Int] = graph.PSCAN(epsilon = 0.1)
println("num communities: " + components.vertices.map{case (vId,cId)=>cId}.distinct.count)
components.vertices.take(10).foreach(println)

The doc said that:

val components: Graph[ComponentID, Int] = graph.PSCAN(epsilon=0.5)
// Graph where each vertex is associated with its component identifier

But when I run above code, I find that, no matter how I tune the value of epsilon, the number of communities is always equals to the number of vertices and component identifier of every vertex is always the same as their vertex id.

I'm wondering whether I misunderstand the docs or there is something wrong with the PSCAN of sparking. Anybody can offer some help? Thanks in advance.


here is my edges file(karate club graph):

0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
0,10
0,11
0,12
0,13
0,17
0,19
0,21
0,31
1,2
1,3
1,7
1,13
1,17
1,19
1,21
1,30
2,3
2,7
2,8
2,9
2,13
2,27
2,28
2,32
3,7
3,12
3,13
4,6
4,10
5,6
5,10
5,16
6,16
8,30
8,32
8,33
9,33
13,33
14,32
14,33
15,32
15,33
18,32
18,33
19,33
20,32
20,33
22,32
22,33
23,25
23,27
23,29
23,32
23,33
24,25
24,27
24,31
25,31
26,29
26,33
27,33
28,31
28,33
29,32
29,33
30,32
30,33
31,32
31,33
32,33

add vertex itself to nodes' neighborhood in PSCAN

The neighborhood of each node in PSCAN should contain itself.

The neighborhood of a vertex is defined in the paper of SCAN:

DEFINITION 1 (VERTEX STRUCTURE)
Let v ∈ V, the structure of v is defined by its neighborhood,
denoted by Γ(v)
Γ(v) = {w ∈ V | (v,w) ∈ E} ∪ {v}
In Figure 1 vertex 6 is a hub sharing neighbors with two clusters.
If we only use the number of shared neighbors, vertex 6 will be
clustered into either of the clusters or cause the two clusters to
merge. Therefore, we normalize the number of common neighbors
by the geometric mean of the two neighborhoods’ size.

If vertex itself is not contained, similarities of all node pairs will be smaller than the expected value.

For example, if two connected vertices have no common neighbors, the similarity will be 0. Which is the same as node pairs that are not connected by an edge. This is apparently not correct.

[question] Modify EigenvectorCentrality

I have a question about modifying eigenvector centrality.

As @riomus pointed out here #10 I should have a look at EigenvectorCentrality and GraphX to properly implement the iterative pregel approach. After having a look at it and modifying the class here https://github.com/geoHeil/graphFrameStarter/blob/master/src/main/scala/myOrg/sparklingGraph/FraudCentrality.scala to calculate the fraudulent percentage some questions are still open for me (see the TODO notes /questions at the link).

It would be great if someone could help me to understand sparkling graph better in order to customize EigenvectorCentrality.

graph frames

Are there any plans to move from graphX to graphFrames?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.