Giter VIP home page Giter VIP logo

artfeateng's Introduction

Case studies for The Art of Feature Engineering

The code behind these case studies is intended as a communication tool for the ideas expressed in the book. This code is as far from production code as it can get. Moreover, many of the techniques presented are available as part of some of the underlining libraries used. Showing a step-by-step implementation of different feature engineering techniques is intended.

See the companion website for details about the book: http://artoffeatureengineering.com/.

The style used in Python is also intentionally kept simple for people coming from other languages that plan to use the ideas described in the book outside of Python.

As described in the book, the case studies explore as many feature engineering ideas within the limits of:

  • At most one day of execution time per notebook.
  • No GPU required.
  • Minimal dependencies.
  • At most 8Gb of RAM.

As a result of these constraints, these notebooks do not undergo as much hyperparameter tuning as necessary. This is a shortcoming of these case studies, keep it in mind if you want to follow a similar path with your experiments.

Minor issues:

As these case studies are foremost an educational tool, I expect readers might want to try variants of some cells in isolation. To help with that, I have tried for the cells to be executable without having to re-run the whole notebook. That means that most cells read everything they need from disk and write all their results back into disk. This is unnecessary with normal notebooks as the values remain in memory, so the code for each cell might look long and somewhat unusual. In a sense, each cell tries to be a separate Python program. To solidify the vision of independent tweaking, I am also distributing these intermediate files besides the input data.

I dislike Pandas with a passion and discourage its use at any level. These notebooks are Pandas-free, which might seem unusual to some.

The last topic in the last chapter (recommendation as imputation) uses more than 8Gb of RAM.

Fetching the data

The data is available in both Zip and Tar BZip2 files. Chapter 9 (images) uses a tile set provided by NASA. It contains 88 thousand tiles occupying 6Gb of space. These tiles are used at the beginning of Chapter 9's notebook to generate 80 thousand boxes around each city or town. These boxes occupy less than 1Gb of disk space. As such, I am distributing the boxes and leaving the tiles for a separate download, in the event you might want to experiment with other techniques extracting more data from the original tiles. Otherwise the feature engineering techniques in Chapter 9's notebook should run fine from the extracted boxes.

Set-up a virtual environment

python3 -m venv feateng source ./feateng/bin/activate

Install local dependencies for graphviz

sudo apt install python-pydot python-pydot-ng graphviz

General dependencies

pip3 install jupyter pip3 install ipykernel pip3 install scikit-learn pip3 install lxml pip3 install numpy pip3 install scipy pip3 install matplotlib pip3 install graphviz

Dependencies for Chapter 7

pip3 install statsmodels

Dependencies for Chapter 8

pip3 install stemming pip3 install gensim

Dependencies for Chapter 9

pip3 install opencv-python

Dependencies for Chapter 10

pip3 install geopy

Launch Jupyter with the feateng environment

python -m ipykernel install --user --name feateng jupyter notebook --no-browser .

Extras

The folder tourism contains a case study for the feature engineering chapter in the book Applied Data Science in Tourism: Interdisciplinary Approches, Methodologies and Applications. It uses pyspark to solve an AirBnB price prediction task.

An extension and improvment for the case studies in Chapter 10 is available in the repository for the RIIAA'20 Workshop "Feature Engineering for Spatial and Temporal Data ".

artfeateng's People

Contributors

drdub avatar xthyax avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.