Giter VIP home page Giter VIP logo

spark-distributed-louvain-modularity's Introduction

dga-graphx

  • GraphX Algorithms

The dga-graphX package contains several pre-built executable graph algorithms built on Spark using the GraphX framework.

pre-requisites

build

If necessary edit the build.gradle file to set your version of spark and graphX

gradle clean dist

Check the build/dist folder for dga-graphx-0.1.jar.

Algorithms

Louvain

about louvain

Louvain distributed community detection is a parallelized version of this work:

Fast unfolding of communities in large networks, 
Vincent D Blondel, Jean-Loup Guillaume, Renaud Lambiotte, Etienne Lefebvre, 
Journal of Statistical Mechanics: Theory and Experiment 2008 (10), P10008 (12pp)

In the original algorithm each vertex examines the communities of its neighbors and makes a chooses a new community based on a function to maximize the calculated change in modularity. In the distributed version all vertices make this choice simultaneously rather than in serial order, updating the graph state after each change. Because choices are made in parallel some choice will be incorrect and will not maximize modularity values, however after repeated iterations community choices become more stable and we get results that closely mirror the serial algorithm.

running louvain

After building the package (See above) you can execute the lovain algorithm against an edge list using the provided script

bin/louvain

Usage: class com.soteradefense.dga.graphx.louvain.Main$ [options] [<property>=<value>....]

  -i <value> | --input <value>
        input file or path  Required.
  -o <value> | --output <value>
        output path Required
  -m <value> | --master <value>
        spark master, local[N] or spark://host:port default=local
  -h <value> | --sparkhome <value>
        SPARK_HOME Required to run on cluster
  -n <value> | --jobname <value>
        job name
  -p <value> | --parallelism <value>
        sets spark.default.parallelism and minSplits on the edge file. default=based on input partitions
  -x <value> | --minprogress <value>
        Number of vertices that must change communites for the algorithm to consider progress. default=2000
  -y <value> | --progresscounter <value>
        Number of times the algorithm can fail to make progress before exiting. default=1
  -d <value> | --edgedelimiter <value>
        specify input file edge delimiter. default=","
  -j <value> | --jars <value>
        comma seperated list of jars
  -z <value> | --ipaddress <value>
        Set to true to convert ipaddresses to Long ids. Defaults to false
  <property>=<value>....

To run a small local example execute:

bin/louvain -i examples/small_edges.tsv -o test_output --edgedelimiter "\t" 2> stderr.txt

Spark produces alot of output, so sending stderr to a log file is recommended. Examine the test_output folder. you should see

test_output/
├── level_0_edges
│   ├── _SUCCESS
│   └── part-00000
├── level_0_vertices
│   ├── _SUCCESS
│   └── part-00000
└── qvalues
    ├── _SUCCESS
    └── part-00000
cat test_output/level_0_vertices/part-00000 
(7,{community:8,communitySigmaTot:13,internalWeight:0,nodeWeight:3})
(4,{community:4,communitySigmaTot:21,internalWeight:0,nodeWeight:4})
(2,{community:4,communitySigmaTot:21,internalWeight:0,nodeWeight:4})
(6,{community:8,communitySigmaTot:13,internalWeight:0,nodeWeight:4})
(8,{community:8,communitySigmaTot:13,internalWeight:0,nodeWeight:3})
(5,{community:4,communitySigmaTot:21,internalWeight:0,nodeWeight:4})
(9,{community:8,communitySigmaTot:13,internalWeight:0,nodeWeight:3})
(3,{community:4,communitySigmaTot:21,internalWeight:0,nodeWeight:4})
(1,{community:4,communitySigmaTot:21,internalWeight:0,nodeWeight:5})

cat test_output/qvalues/part-00000 
(0,0.4134948096885813)

Note: the output is laid out as if you were in hdfs even when running local. For each level you see an edges directory and a vertices directory. The "level" refers to the number of times the graph has been "community compressed". At level 1 all of the level 0 vertices in community X are represented by a single vertex with the VertexID: X. For the small example all modulairyt was maximized with no community compression so only level 0 was computed. The vertices show the state of each vertex while the edges file specify the graph structure. The qvalues directory lists the modularity of the graph at each level of compression. For this example you should be able to see all of vertices splitting off into two distinct communities (community 4 and 8 ) with a final qvalue of ~ 0.413

running louvain on a cluster

To run on a cluster be sure your input and output paths are of the form "hdfs:///path" and ensure you provide the --master and --sparkhome options. The --jars option is already set by the louvain script itself and need not be applied.

parallelism

To change the level of parallelism use the -p or --parallelism option. If this option is not set parallelism will be based on the layout of the input data in HDFS. The number of partitions of the input file sets the level of parallelism.

advanced

If you would like to include the louvain algorithm in your own compute pipeline or create a custom output format, etc you can easily do so by extending the com.soteradefense.dga.graphx.louvain.LouvainHarness class. See HDFSLouvainRunner which extends LouvainHarness and is called by Main for the example above

spark-distributed-louvain-modularity's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

spark-distributed-louvain-modularity's Issues

A way to get the nodes within a community

After a couple of iterations the final graph to be saved is a compressed version of the original one. How can I assign communities to all the original nodes in my graph given some level.

COmpile java error

Command:

gradle clean dist -Pcdhversion=cdh4

Error:
Task :dga-giraph:compileJava FAILED
/home/hduser/distributed-graph-analytics/dga-giraph/src/main/java/com/soteradefense/dga/DGAYarnRunner.java:21: error: package org.apache.hadoop.yarn.conf does not exist
import org.apache.hadoop.yarn.conf.YarnConfiguration;
^
/home/hduser/distributed-graph-analytics/dga-giraph/src/main/java/com/soteradefense/dga/DGAYarnRunner.java:28: error: cannot find symbol
UserGroupInformation.createRemoteUser(YarnConfiguration.DEFAULT_NM_NONSECURE_MODE_LOCAL_USER).doAs(new PrivilegedAction() {
^
symbol: variable YarnConfiguration
location: class DGAYarnRunner
Note: /home/hduser/distributed-graph-analytics/dga-giraph/src/main/java/com/soteradefense/dga/io/formats/DGAVertexOutputFormat.java uses unchecked or unsafe operations.
Note: Recompile with -Xlint:unchecked for details.
2 errors

FAILURE: Build failed with an exception.

  • What went wrong:
    Execution failed for task ':dga-giraph:compileJava'.

Please help

could not find main class

hi,
i am trying to run your code in an ubuntu machine.
after compiling and packaging i tried the command louvian but i always get this error:

Error: Could not find or load main class com.soteradefense.dga.graphx.louvain.Main

what probably could be the cause and how can i fix it.

Regards
aljawarneh

Error: Could not find or load main class com.soteradefense.dga.graphx.DGARunner

hi,
I am trying to run the same code on Spark 2.10 (Hadoop 2.7) and Scala 2.11.4. When I try to set my version in build.gradle, I am getting the following warnings during "build clean dist"

/home/hduser/distributed-graph-analytics/dga-graphx/src/main/scala/com/soteradefense/dga/graphx/hbse/HighBetweennessCore.scala:357: method mapReduceTriplets in class Graph is deprecated: use aggregateMessages
    val pingRDD = hbseGraph.mapReduceTriplets(sendPingMessage, merge[Long]).cache()
                            ^
/home/hduser/distributed-graph-analytics/dga-graphx/src/main/scala/com/soteradefense/dga/graphx/hbse/HighBetweennessCore.scala:390: method mapReduceTriplets in class Graph is deprecated: use aggregateMessages
    val msgRDD = mergedGraph.mapReduceTriplets(sendDependencyMessage, merge[(Long, Double, Long)]).cache()
                             ^
/home/hduser/distributed-graph-analytics/dga-graphx/src/main/scala/com/soteradefense/dga/graphx/kryo/DGAKryoRegistrator.scala:32: class GraphKryoRegistrator in package graphx is deprecated: Register GraphX classes with Kryo using GraphXUtils.registerKryoClasses
class DGAKryoRegistrator extends GraphKryoRegistrator {
                                 ^
/home/hduser/distributed-graph-analytics/dga-graphx/src/main/scala/com/soteradefense/dga/graphx/wcc/WeaklyConnectionComponentsCore.scala:33: method mapReduceTriplets in class Graph is deprecated: use aggregateMessages
    val initialComponentCalculation: VertexRDD[VertexId] = graph.mapReduceTriplets(triplet => {
                                                                 ^
four warnings found
:dga-graphx:processResources
:dga-graphx:classes
:dga-graphx:jar
:dga-graphx:assemble
:dga-graphx:distConf
:dga-graphx:distJars
:dga-graphx:dist

BUILD SUCCESSFUL

When I try to run louvain, I get the following error:
`Error: Could not find or load main class com.soteradefense.dga.graphx.DGARunner

I also tried rewriting the dga-mr1-graphx like:

#! /bin/bash

export DGA_CLASSPATH=/distributed-graph-analytics/dga-graphx/build/dist/lib
export SPARK_JARS_ASSEMBLY=/usr/local/spark/spark-2.1.0-bin-hadoop2.7/jars/:/usr/local/src/scala/scala-2.11.8/lib/

T="$(date +%s)"

java -cp $SPARK_JARS_ASSEMBLY:$DGA_CLASSPATH/dga-graphx-0.1.jar:$DGA_CLASSPATH/dga-core-0.0.1.jar:$DGA_CLASSPATH/config-1.2.1.jar:./conf com.soteradefense.dga.graphx.DGARunner "$@"

T="$(($(date +%s)-T))"
echo "Time in seconds: ${T}"

but I keep getting the same error

Kindly help me with the issue. I am new to Spark and Communty Detection.

Might be a bug?

In line 233 of This file, when calculating k_i_in_L, the variable "internalWeight" is added if the community that the node is in and the community the node is testing is the same community. If I understand it correctly, variable "internalWeight" is the self-loop edge. If that is the case the weight of self loop should not be included in k_i_in because both ends of this edge are the same node, and therefore are always in the same community. @eric-kimbrel

LouvainCore deltaQ function

in line 240 of LouvainCore.scala: deltaQ = k_i_in - ( k_i * sigma_tot / M)
M is the total weight of the graph. My understanding is that M = 2m (m: number of edges )
so, deltaQ may should be calculated as k_i_in - ( k_i * sigma_tot / (M/2)) instead? @eric-kimbrel

EMR and YARN

Has anyone been successful in getting this to work on an EMR cluster, which is managed by YARN? Everything is running fine locally, but I have not been able to get this to work with the cluster mode.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.