Implement a watered-down version of the PageRank algorithm, using parsed data from a sample of Wikipedia's edit history.
Elizabeth Konstantin Kwek Jin Li : 2287563K
Ng Kai Feng Bernard : 2287600N
University of Glasgow Computing Science Level 4
Choose where to access data files and execute jobs using the following:
sudo select-cluster [local | idup | bd]
Then from within this directory, run the following commands on the terminal:
mvn clean package
export HADOOP_CLASSPATH="$PWD/target/uog-bigdata-0.0.1-SNAPSHOT.jar"
hadoop ax.PageRank [input path] [output folder name] [rounds]
The process is split into two main jobs, Init
and Iter
. Init
extracts the relevant data and pairs each outbound link with the calling article. Iter
then takes these Key-Value pairs and calculates the page rank of every mentioned article. This is done in as many iterations as specified by the integer [rounds]
in the command:
hadoop ax.PageRank [input path] [output folder name] [rounds]
If [rounds]
is not specified, the default 2
is used.
REVISION 12 30364918 Anarchism 2005-12-06T17:44:47Z RJII 141644
CATEGORY Social_philosophy Political_theories Forms_of_government Anarchism Political_ideology_entry_points
IMAGE Zerzan.jpeg Goldman-4.jpg Kropotkin.jpg BenjaminTucker.jpg Blkred_flag.png Bakuninfull.jpg JohannMost.jpg Murray_Rothbard_Smile.JPG LeoTolstoy.jpg Proudhon-young.jpg
MAIN 467_BC Seattle,_Washington Edward_Abbey Félix_Guattari Head_of_state John_Brown_(abolitionist) Environmental_movement Auroville Peter_Kropotkin Peter_Kropotkin Peter_Kropotkin Peter_Kropotkin Public_key_cryptography Autonomism
TALK
USER
USER_TALK
OTHER Eu:Anarkismo Ar:لاسلطوية No:Anarkisme Sl:Anarhizem Is:Stjórnleysisstefna Fa:دولتزدائی Lt:Anarchizmas Gl:Anarquismo
EXTERNAL http://dwardmac.pitzer.edu/Anarchist_Archives/bright/berkman/comanarchism/whatis_toc.html http://www.retailworkers.org/ http://www.infoshop.org/inews/article.php?story=05/01/06/7640998 http://www.barefootsworld.net/nockoets0.html
TEMPLATE Mergefrom Anarchism
COMMENT /* The first self-labeled anarchist */ small clarification on expropriation
MINOR 0
TEXTDATA 11530
Of interest to this project are the lines REVISION
and MAIN
.
In REVISION
, the third item article_title
is used to uniquely identify relevant articles, and is used as Keys in the InitMapper during initialization. All revisions are considered in this implementation.
MAIN
provides the article_title
s of other pages linked to (i.e., outlinks) by the current article. All references are mapped and passed on to InitReducer, which then filters out repeated data, and assigns each article an initial PageRank score (more on this below).
Input:
K: [Line Number]
V: [Revision Record - 14lines, including 1 empty line at the end]Output:
K:article_title
V:outlink_title
Output: (if no data in MAIN)
K:article_title
V: empty
Input:
K:article_title
V:outlink_title
(or empty)Output:
K:article_title
V:PR outlink1,outlink2,....
PR(u)= (1 - DF) + DF * Sum(PR(v)/L(v))
PR(v)/L(v)
represents the contribution of an article v
that links to the page u
. The contributions of ALL pages that reference to u
are totalled and multiplied by the Damping Factor (DF
), which in this project is set to 0.85
. The result is added to 1 - DF
to get the PageRank of u
.
This formula is applied in rounds, with all pages initially assigned a PageRank score of 1.0
(INIT_RANK
).
The values of INIT_RANK
and DF
are set in the main PageRank class and sent to the Init
and Iter
reducers respectively to be used in the calculation of each page's the PageRank score.
To run this step in repeated rounds, the IterMapper produces two kind of output:
A - preserves the list of outlinks from each page
B - sends the relevant numbers needed for calculating PageRank
The IterReducer then uses Output B
to calculate the PageRank score of an article, and outputs that score in the same format as the input of the Mapper.
After each step, the main PageRank class reassigns the input and output paths and starts a new Job running the same task.
Input:
K: [Line Number]
V:article_title PR outlink1,outlink2,....
Output A:
K:article_title
V:LINK_PREFIX outlink1,outlink2,....
Output B:
K:linkX
V:PR count(outlinks from title)
Notes: LINK_PREFIX
is an arbitrary string to help IterReducer differentiate the 2 output types. linkX
refers to an outlink from article_title
. Every outlink will have its own Key-Value pair.
Input A:
K:article_title
V:LINK_PREFIX outlink1,outlink2,....
Input B:
K:linkX
V:PR
count(outlinks from title)Output:
K:article_title
V:PR outlink1,outlink2,....
Output: (LAST ITERATION)
K:article_title
V:PR
None. Only the package name in the hadoop
command is changed, from mapreduce
to ax
. The files in the mapreduce
directory are removed.