Fast Convergence PageRank in Hadoop http://edu-cornell-cs-cs5300s15-proj2.s3.amazonaws.com/project2.html
Yu-Che Hsueh & Shaoke Xu
- input: stores pre-processed input data.
- result: contains files recorded the residual error, average number of iteration per block and page rank value of the two lowest-numbered nodes in each block.
- src: source codes
- jar: backup jar file
- Simple PageRank: <initial_PageRank> <outgoing_list>
- Block PageRank: <initial_PageRank> <outgoing_list>
- WebPageScanner.java: capture data from web page
- RemoveReject.java: filter data by function in instruction
- CombineOutgoing.java: combine outgoing nodes for each source
- addBlockID.java: add block ID in the first column
- RandomPartition.java: use hash function to add blockID (for 7.2 Random Block Partition)
- SimplePR.java: main function. Setting parameters of job and input/output path and getting average residual from Counter.
- SimpleMap.java: mapper function. Calculating partial new PageRank for each node.
- SimpleReduce.java: reducer function. Calculating total new PageRank for each node and updating residual by Counter.
- Counter.java: enum class that store residual error
- NodeData.java: self-defined data structure used to represent a node, including nodeID, edgeList, pageRank and degrees.
- PageRankBlock.java: main function. Setting parameters of job and input/output path and getting average residual from Counter. Iterating loop until satisfy the requirement of residual error.
- PageRankBlockMapper.java: mapper function. For more information, see slides.
- PageRankBlockReducer.java: reducer function. For more information, see slides.
-
Compile JAR by Eclipse Remember to use Java 1.7 and include hadoop external JAR
-
Upload data to AWS S3
-
Configure Command Line Interface(CLI) and create key pair
http://docs.aws.amazon.com/cli/latest/userguide/tutorial-ec2-ubuntu.html
-
Create EMR cluster by console
-
SSH to master
ssh [email protected] -i ~/devenv-key.pem
-
Upload JAR
scp -i ~/devenv-key.pem PageRankBlock.jar [email protected]:PageRankBlock.jar
-
Run JAR
bin/hadoop jar PageRankBlock.jar PageRankBlock s3://<bucket_name>/<file_name> s3://<output_bucket>/
Compute filter parameters for netID sx77 double fromNetID = 0.77; double rejectMin = 0.9 * fromNetID; (0.693) double rejectLimit = rejectMin + 0.01; (0.703)
Number of Nodes: 619,588 from 685,230
-
Jacobi vs Gauss-Seidel From the results (appendix), we can notice that the residual errors in two methods are similar while the average numbers of iterations per block are different, especially at the beginning. Gauss method performs better since it reduces the times of iteration. For example, average numbers of iterations per block in Gauss are 9.794118, 5.352941, 4.529412, etc. There are 16.029411, 7.897059, 5.9558825 and so on in the Jacobi method.
-
Convergence of node-by-node, intelligently blocked and randomly blocked
-
node-by-node: 21 times
-
intelligently blocked: 6 times
-
randomly blocked(Jacobi): 22 times
-
randomly blocked(Gauss): 22 times
AWS Documentation https://github.com/MatthewGreen/CS5300-Project2