Giter VIP home page Giter VIP logo

data512_project's Introduction

Created: 12/4/2019
DATA 512 Human Centered Data Science
Peter Meleney

An Introduction to Statistical Learning: Applications in Python

Introduction

Objective

The objective of these notebooks is to give aspiring data scientists, who are familiar with python 3.x, but not R, access to the laboratory sections of the book "An introduction to Statistical Learning with Applications in R" (ISLR). This repository is meant to be an complement to and extention of JWarmhoven's excellent repository availabale here: https://github.com/JWarmenhoven/ISLR-python/.

Motivation

It is my opinion that ISLR is an excellent introduction to data science, and I have read it many times. However I have always been frustrated that it was written exclusively using the R language for examples and labs. In my research for this project I found that R was the predominate language of data science professionals when the book was written in 2013 (see figure 1), however that changed rapidly between 2013 and today (2019) when python is the most prevelant language among data science professionals (see figure 2).

image of data science prefs 2013

In 2013, R was the dominate language among data science professionals. In this study conducted by KD Nuggets, which asked the question: "What programming/statistices languages you used for an analytics / data mining / data science work in 2013?," among the 713 respondents to that question, over 60% of them used R, and only 38.8% of them used python.

image of data science prefs 2018

However by 2018, a much larger study conducted by Kaggle which asked "What specific programming language do you use most often?" among the 23,859 respondents 54% responded that Python was the language they used most, followed by R at only 13%. This may explain why the authors originally chose R as the consensus language among data professionals in 2013, but the evolution of the field towards use of python strongly motivates my work in its translation to Python.

Justifications

Jupyter Notebook

Jupyter Notebooks provide an excellent platform for creating the literate, interactive programming documents that I wish to create for this project. Many data science professionals as familiar with jupyter notebooks, and I am fluent in their use, making them an excellent choice for this project.

Python 3.7

Though Python 3.8.0 was officially released on October 14, 2019, it is not yet supported by Anaconda, which is the distribution of Python I use to manage environments. I do not expect that any lack of the expansions in the lexecon of Python will inhibit our exploration of these introductory ideas.

Scikit-Learn

Whenever possible we will use the scikit-learn version of machine learning models. Scikit-learn is an excellent data science module available as a free, opensource, and commercially usable set of tools for python. It is what I originally learned on and I think that any aspiring data scientist should be introduced to its wonderful set of tools. We will be using the latest version of scikit-learn, version 0.22 - more information is available here: https://scikit-learn.org/stable/.

Scripting Coding Style

There are several stylistic options to choose from when writing python code. In heigherarctical order there is scripting style, functional style, and object-oriented style. For this project I will use the scripting style of programming because it is most condusive to working in jupyter notebooks - e.g. functions don't need to be speficied and then run, the script can be run directly in the interactive layer. I anticipate using functions sparsely throughout the project, mostly to facilitate data munging.

Data

Detailed data exploration will be left to the individual chapters, as extensive exploration here would be too long, and lead to the reader flipping back and forth between this README document and the jupyter notebook. Here I will note that all the data, with the exception of the "Boston" data set, has been released by the authors of ISLR under the GPL-2 license for reuse (https://cran.r-project.org/web/packages/ISLR/index.html). . The “Boston” dataset is used widely as a benchmark throughout data science and computer science, and is made available through the University of Toronto Computer Science Department at https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html. The data was originally collected by the U.S. Census Service, and so is in the public domain.

Name Description
Boston Housing values and other information about Boston suburbs.
Hitters Records and salaries for baseball players.
Smarket Daily percentage returns for S&P 500 over a 5-year period.

data512_project's People

Contributors

pmeleney avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.