This Hadoop project find all the link in Wikipedia.
It parse parse the full wikipedia 100 Go dump XML.
A wikipedia dump sample is available in the /wikidump folder. Unzip it and import the xml into hadoop.
$ cd wikidump
$ unzip wikidump_sample.xml.zip
$ hadoop fs -mkdir hadoop_sample
$ hadoop fs -copyFromLocal wikidump_sample.xml hadoop_sample/
$ cd ../
$ mvn clean install
$ hadoop jar today/target/today-1-jar-with-dependencies.jar 'hadoop_sample' 'hadoop_sample_result'
$ hadoop fs -copyToLocal hadoop_sample_result hadoop_sample_result
$ vim hadoop_sample_result/part-r-00000
PageName | TotalLink | (LinkType|LinkPage) | (LinkType|LinkPage) | (LinkType|LinkPage)
--------------|-----------|---------------------------|----------------------------|--------------------------
Data_register | 1 | 0|Atomic_semantics | |
Demographics | 3 | 0|Demographics_of_Armenia | 0|Demographics_of_American | 0|Demographics_of_Angola
Kevin_Gilbert | 1 | 1|Talk%3AAutoerotic | |
All the LinkType listed on wikipedia
Developed by Martin Magakian [email protected]
For doduck prototype
Pour doduck prototype (site en Francais)
MIT License (MIT)