Giter VIP home page Giter VIP logo

kdd2018_air_pollution_prediction's Introduction

Introduction

KDD 2018 held a competition to predict the intensity of air pollutants in Beijing and London for the next 48 hours during a month. This is our team's code (NIPL Rises) that achieved 16th position (among 4000 teams) in last 10 days category executed on a single laptop. We used tensorflow to build a hybrid model composed of CNNs, LSTMs, and MLPs for an end-to-end prediction over 2D grid data, time-series, and categorical data. This project includes (1) fetching and crawling of weather and pollutant data from multiple sources, (2) data cleaning and integration, (3) visualization for insights, and (4) prediction of pollutants (PM2.5, PM10, O3) for the next 48 hours in Beijing and London.

Setup

  1. Download csv files from KDD 2018 data repository (requires sign-up),
  2. Install required packages including tensorflow and keras (for deep learning), selenium (for web crawling),
  3. Copy default.config.ini to config.ini,
  4. Download chromedriver.exe for web crawling, and set the address
    CHROME_DRIVER_PATH = path to chromedriver.exe
    
  5. Set the addresses of downloaded data-sets,
    BJ_AQ = Beijing air quality history
    BJ_AQ_REST = Beijing air quality history for 2nd and 3rd months of 2018
    BJ_AQ_STATIONS = Beijing air quality stations
    BJ_MEO = Beijing meteorological history
    BJ_GRID_DATA = Beijing grid weather history
    LD_* = same for London
    
  6. Set the addresses for fetched/cleaned data to be stored,
    BJ_AQ_LIVE = fetched Beijing air quality live data
    BJ_MEO_LIVE = fetched Beijing meteorology live data
    BJ_OBSERVED = cleaned Beijing observed air quality and meteorology time series
    BJ_OBSERVED_MISS = marked missing data in BJ_OBSERVED
    BJ_STATIONS = cleaned data of stations in Beijing
    BJ_GRID_LIVE = fetched grid of current weather in Beijing
    BJ_GRID_FORECAST = fetched grid of forecast weather in Beijing
    BJ_GRIDS = history of grid data in Beijing
    BJ_GRID_COARSE = coarsened grid of data to lower resolutions
    LD_* = same for London
    
  7. Set [lower, upper] bounds for date intervals of urls
    BJ_AQ_URL = */2018-06-05-0/2k0d1d8
    BJ_MEO_URL = */2018-06-05-0/2k0d1d8
    BJ_GRID_URL = */2018-06-05-0/2k0d1d8
    LD_*_URL = same for London
    
  8. Set a path for generated features and models
    FEATURE_DIR = directory for extracted features
    MODEL_DIR = directory for generated models
    

Execution

  1. Data pre-process
    1. Run src/preprocess/preprocess_all.py to create the cleaned data sets in your pre-specified addresses,
  2. Data visualization
    1. Run scripts in src/statistics to gain basic insights about value distributions, time series and geographical positions
    2. Change BJ_* to LD_* for London data
  3. Feature generation
    1. Go to main method of src/feature_generators/hybrid_fg.py,
    2. Uncomment desired (city-pollutant, sample rate) pairs in cases variable (all pairs are eventually required). Higher sample rate, larger data,
    3. Run the script
  4. Model training
    1. Go to src/methods/lstm_pre_train.py
    2. Run the script; simple LSTM models are pre-trained for all pollutants. These models are fed (unchanged) to the final model for better performance,
    3. Go to main method of src/methods/hybrid.py,
    4. Uncomment desired city-pollutant,
    5. Run the script; best model so far will be saved automatically
  5. Model testing
    1. Go to src/methods/model_tests.py,
    2. Uncomment desired city-pollutant, set a time interval in TEST_FROM and TEST_TO,
    3. Run the script; SMAPE score will be printed.
    4. Go to src/methods/model_investigate.py,
    5. Run the script to see SMAPE score per station sorted and geographically visualized.
  6. Prediction
    1. Go to src/predict_next_48.py
    2. Change timedelta if you wish to predict previous 48 hours,
    3. Run the script

Toy Examples

  1. examples/gcforest includes basic examples of using forests instead of neurons to do deep learning proposed in this paper,
  2. examples/tensorflow includes basic examples of using tensorflow

kdd2018_air_pollution_prediction's People

Contributors

pouyaesm avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.