CS5300 Project 2 Spring 2015

Fast Convergence PageRank in Hadoop http://edu-cornell-cs-cs5300s15-proj2.s3.amazonaws.com/project2.html

Yu-Che Hsueh & Shaoke Xu

Folders

input: stores pre-processed input data.
result: contains files recorded the residual error, average number of iteration per block and page rank value of the two lowest-numbered nodes in each block.
src: source codes
jar: backup jar file

Input data format

Simple PageRank: <initial_PageRank> <outgoing_list>
Block PageRank: <initial_PageRank> <outgoing_list>

Data processing

WebPageScanner.java: capture data from web page
RemoveReject.java: filter data by function in instruction
CombineOutgoing.java: combine outgoing nodes for each source
addBlockID.java: add block ID in the first column
RandomPartition.java: use hash function to add blockID (for 7.2 Random Block Partition)

Simple PageRank

SimplePR.java: main function. Setting parameters of job and input/output path and getting average residual from Counter.
SimpleMap.java: mapper function. Calculating partial new PageRank for each node.
SimpleReduce.java: reducer function. Calculating total new PageRank for each node and updating residual by Counter.
Counter.java: enum class that store residual error

Block PageRank

NodeData.java: self-defined data structure used to represent a node, including nodeID, edgeList, pageRank and degrees.
PageRankBlock.java: main function. Setting parameters of job and input/output path and getting average residual from Counter. Iterating loop until satisfy the requirement of residual error.
PageRankBlockMapper.java: mapper function. For more information, see slides.
PageRankBlockReducer.java: reducer function. For more information, see slides.

How To Run Block Map Reduce on EMR

Compile JAR by Eclipse Remember to use Java 1.7 and include hadoop external JAR
Upload data to AWS S3
Configure Command Line Interface(CLI) and create key pair http://docs.aws.amazon.com/cli/latest/userguide/tutorial-ec2-ubuntu.html
Create EMR cluster by console
SSH to master ssh [email protected] -i ~/devenv-key.pem
Upload JAR scp -i ~/devenv-key.pem PageRankBlock.jar [email protected]:PageRankBlock.jar
Run JAR bin/hadoop jar PageRankBlock.jar PageRankBlock s3://<bucket_name>/<file_name> s3://<output_bucket>/

Filter Parameters

Compute filter parameters for netID sx77 double fromNetID = 0.77; double rejectMin = 0.9 * fromNetID; (0.693) double rejectLimit = rejectMin + 0.01; (0.703)

Number of Nodes: 619,588 from 685,230

Comparison:

Jacobi vs Gauss-Seidel From the results (appendix), we can notice that the residual errors in two methods are similar while the average numbers of iterations per block are different, especially at the beginning. Gauss method performs better since it reduces the times of iteration. For example, average numbers of iterations per block in Gauss are 9.794118, 5.352941, 4.529412, etc. There are 16.029411, 7.897059, 5.9558825 and so on in the Jacobi method.
Convergence of node-by-node, intelligently blocked and randomly blocked
node-by-node: 21 times
intelligently blocked: 6 times
randomly blocked(Jacobi): 22 times
randomly blocked(Gauss): 22 times

Reference

AWS Documentation https://github.com/MatthewGreen/CS5300-Project2

xushaoke / 5300proj2 Goto Github PK

5300proj2's Introduction

CS5300 Project 2 Spring 2015

Folders

Input data format

Data processing

Simple PageRank

Block PageRank

How To Run Block Map Reduce on EMR

Filter Parameters

Comparison:

Reference

5300proj2's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent