Giter VIP home page Giter VIP logo

wiki-download-parse-page-views's Introduction

Download, Parse, Aggregate Wikpedia Page View Dumps

Pipeline for downloading, parsing and aggregating static page view dumps from Wikipedia.

How it works?

In case you need an anual number of pageviews for specific pages on Wikipedia before 2015. you will unfortunately not be able to rely on the API (at least not at time of writing this doc) as it gives access to new records (post 2015). However, a collection of static dumps is available.

This pipline was made in order to:

  1. Fetch names of all files to be downloaded
  2. Download the needed files (paralelized)
  3. Parse them after downloading (paralelized)
  4. Aggregate files for each year in order to get the anual number of views for selected pages

The following scripts need to be ran respectively:

  1. fetch_file_names.py
  2. downloader.py
  3. parser.py
  4. group_by.py

Fetching file names and URLs

First, we need to get the names of files we want to download. For every year, there is a set of files available, so it is also good to specify about which years we are interested in.

fetch_file_names.py

The script generates a csv file containing file names, file sizes and URLs from which the files should be downloaded. Script parameters:

  • year_start - first year to be downloaded
  • year_end - last year to be downloaded (all years in between are downloaded)
  • output_dir - directory where files for each year will be stored
python fetch_file_names.py  [year_start] [year_end] [output_dir]

Output file

file size url
pagecounts-20140101-000000.gz 82 https://..
pagecounts-20140201-000000.gz 81 https://..
... ... ...

Downloading files

Now, when we have downloaded the file names and URLs, we can download them!

downloader.py

This script concurently downloads Wikipedia pagecount dumps [qzip]. The file previously generated file.csv contains a list of urls for the files mentioned. The path_save refers to directory where files should be downloaded.

python downloader.py [file.csv] [path_save] [thread_number]

THE SERVER IS CURRENTLY BLOCKING IN CASE OF USING MORE THEN 3 THREADS

Parsing files

As the files have information on every page on Wikipedia which was accessed within the hour specified in the file name, we should remove page names that we do not need.

Input file

For parsing, a csv file containing wikipedia page names has to be provided in the following format:

names_u names_q
Barack_Obama Barack_Obama
René_Konen Ren%C3%A9_Konen
Zoran_Đinđić Zoran_%C4%90in%C4%91i%C4%87
... ...

The column names_u is standard utf-8 encoding (the unquated representation), however in the files a nother type of encoding is used, so we need a names_q which is the 'qouated' representation. Both quote and unquote can be done with urllib.

parser.py

Opens specified list of files in files_dir, filters them per names in page_names_file and project_name ("en" for english wikipedia, "de" for german, etc.), saves filtered files in save_dir using a specified num_threads.

python parser.py [page_names_file] [files_dir] [save_dir] [project_name] [num_threads]

Getting the aggregated pageviews

After parsing the files, it is time to aggregate the page views!

Loads files from file_dir as pandas dataframes, concatinates them, performs aggregation and saves them as csv on save_path.

python groupby.py [file_dir] [save_path] 

Output file

names_u names_q views
Barack_Obama Barack_Obama 3562998
René_Konen Ren%C3%A9_Konen 156456
Zoran_Đinđić Zoran_%C4%90in%C4%91i%C4%87 96846
... ... ...

Dependencies

#todo requirements.txt

wiki-download-parse-page-views's People

Contributors

svujke avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.