Giter VIP home page Giter VIP logo

zillow-kaggle's Introduction

Zillow-Kaggle

This repo tackles the first round of Zillow’s Home Value Prediction Competition, which challenges competitors to predict the log error between Zestimate and the actual sale price of houses. And the submissions are evaluated based on Mean Absolute Error between the predicted log error and the actual log error. The competition was hosted from May 2017 to October 2017 on Kaggle, and the final private leaderboard was revealed after the evaluation period ended in January 2018.

The score of the stacked model in this repo would have ranked in the top-100 and qualified for the private second round, which involves building a home valuation algorithm from ground up.

Author: Junjie Dong ([email protected])

Files

  • explanatory.ipynb performs basic explanatory data analysis. The analysis is by no means comprehensive, since most ad-hoc data analysis were performed while extracting features and building models.
  • feature_extraction.ipynb cleans up the raw data, unifies data types and the representations of missing values, and extracts various kinds of features (eg. interaction features, region-based aggregate features). The features are saved to hdf5 files that can be easily loaded by other notebooks for modeling.
  • model_lgb.ipynb builds LightGBM models on top of the extracted features.
  • model_catboost.ipynb builds CatBoost models (both single model and ensemble model) on top of the extracted features.
  • stack.py performs simple stacking by taking a linear combination of the predictions from LightGBM and CatBoost.
  • src/data_proc.py includes several helper methods for data cleaning and feature extraction

Models

Both the LightGBM and CatBoost models are mostly tuned based on offline cross validation, with no public leaderboard probing/overfitting. The weight for the final linear stacking is chosen based on public leaderboard scores. Outliers in the offline training and validation sets were carefully handled so that the offline validation method is as reliable as possible.

The following table outlines the chosen models' performance on the hidden private test set.

Model Private Leaderboard Score Private Leaderboard Ranking Percentile (Top)
LightGBM (single) 0.0752026 332 / 3779 8.8%
CatBoost (single) 0.0751456 250 / 3779 6.6%
CatBoost (ensemble x8) 0.0750750 147 / 3779 3.9%
Stack (LightGBM + CatBoost) 0.0750213 95 / 3779 2.5%

For this dataset, CatBoost with its default hyperparameters gives very strong performance, and the model required almost no hyperparameter tuning. In comparison, LightGBM required much more hyperparameter tuning to achieve good performance.

Also, it turns out that ensembling and stacking are essential to climbing up the leaderboard in this competition. And I believe there is still some room for improvement just by tuning the LightGBM hyperparameters and switching to a better stacking method (eg. train a meta-model on top of the base models' predictions on a hold-out set).

zillow-kaggle's People

Contributors

junjiedong avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.