Giter VIP home page Giter VIP logo

cs229-f19-wiki-forecast's Introduction

Wikipedia Pageview Forecasting

A machine learning project for CS229 Fall '19.

Quickstart

Checkout the repository

git clone --recursive https://github.com/acmiyaguchi/cs229-f19-wiki-forecast.git

# if you have already checked out the repo and need to initialize submodules
git submodule update --init --recursive

Download the data

The data has been preprocessed into compact Parquet datasets using the epfl-ls2/sparkwiki project. The Wikipedia SQL dumps from 20190820 and pageviews from 2018-01-01 to 2019-09-01 have been processed.

Contact @acmiyaguchi for access to the cloud storage bucket. If you have been added to the project, the files may be downloaded via gsutil:

gsutil ls gs://wiki-forecast-data

Quickstart

Installing dependencies

This repository uses Pipenv for managing the relevant Python dependencies and installing Spark.

# install dependencies
pipenv sync --dev

# start the virtual environment
pipenv shell

Running the experiments

Topic-specific sample data is available directly in the repository for testing. To run the command on the sample data, run the following command on at the project root.

python -m wikicast baseline

This section of the codebase is pure python.

Download the data for running the full-scale experiment. The following folders should exist under the data/ directory.

data/enwiki/pages/
data/enwiki/pagelinks/
data/enwiki/pagecount_daily_v2/

Run the following commands to run the experiments on a graph sampled randomly from articles with a connectivity contraint. This is typically around 35,000 articles.

# in bash on Linux or MacOS
scripts/run-command baseline_random

# in powershell on Windows
scripts/run-command.ps1 baseline_random

This requires Java 1.8 to run. Spark is installed from pip, and GraphFrames are installed from spark-packages.

Running Jupyter with Spark and GraphFrames

Once in the shell, several spark variables should be set to keep the working environment consistent across machines. For convenience, run one of the following commands.

# in bash on Linux or MacOS
scripts/start-jupyter`

# in powershell on Windows
scripts/start-jupyter.ps1

cs229-f19-wiki-forecast's People

Contributors

acmiyaguchi avatar shaonc avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.