Giter VIP home page Giter VIP logo

datascience's Introduction

Introduction to Data Science

Course Overview:

This course is about learning from data, in order to gain useful predictions and insights. Separating signal from noise presents many computational and inferential challenges, which we approach from a perspective at the interface of computer science and statistics. Through real-world examples of wide interest, we introduce methods for five key facets of an investigation:

● data munging/scraping/sampling/cleaning in order to get an informative, manageable data set

● data storage and management in order

● to be able to access data - especially big data - quickly and reliably during subsequent analysis

● exploratory data analysis to generate hypotheses and intuition about the data

● prediction based on statistical tools such as regression, classification, and clustering and

● communication of results through visualization, stories, and interpretable summaries.

Learning Outcomes After successful completion of this course, you will be able to:

● Use Python and other tools to scrape, clean, and process data

● Use data management techniques to store data locally and in cloud infrastructures

● Use statistical methods and visualization to quickly explore data

● Apply statistics and computational analysis to make predictions based on data

● Apply basic computer science concepts such as modularity, abstraction, and encapsulation to data analysis problems

Week -1

  1. Introduction to Data Science (What is Data Science and Data Science Process, What are Data, Data Sources, Types of Data, Data Formats, Messy Data)
  2. Data Exploration (Basics of Sampling, Biases in Sampling, Measures of Centrality, Measures of Spread, Principles of Visualizations, Histograms, Scatter plots, Pie Charts, Stacked Area Graphs, Box Plots, 3D data)
  3. Data Engineering (Tabular Data, Pandas and Scraping)
  4. Exploratory Data Analysis and Effective Data Visualizations
  5. Code Camp-1 (Web Scraping and EDA: Scraping Wikipedia Billboard for Top 100 : Homework 1) https://canvas.harvard.edu/courses/29726/assignments/167915

Week -2

  1. Introduction to Regression (Statistical modeling, predicting a variable, Regression Vs Classification, Error, Loss functions, Line of Best Fit, K nearest neighbours)
  2. Linear Regression (Linear Regression, Comparing Models, Evaluating Model, Model Uncertainty, Bootstrapping for estimating sampling error, Model Fitness R^2, Training Vs Testing sets
  3. Multiple Linear Regression - 1 (Multiple Linear Regression, Evaluating Significance of Predictors, Hypothesis testing, R^2, Information Criteria, AIC/BIC)
  4. Multiple Linear Regression - 2 (Comparing parametric and nonparametric models, Multiple Linear Regression with Interaction Terms, Polynomial Regression)
  5. Code Camp-2 (Multiple Linear Regression: Forecasting Bike Sharing Usage: Homework 2) https://canvas.harvard.edu/courses/29726/assignments/172398

Week -3

  1. Model Selection (overfitting, model selection, variable selection, Cross Validation)
  2. Regularization (Bias Vs Variance, Regularization: LASSO and Ridge
  3. PCA (High Dimensionality, Dimensionality Reduction, PCA, Using PCA for Regression)
  4. Visualization for Communication ( Visualization Goals, Effective Visualizations, Tools for interactive graphics, Structure of Communicative Graphics, Application to modeling)
  5. Code Camp-3 (Model Selection, Regularization, PCA: Forecasting Bike Sharing Usage: Homework 3) https://canvas.harvard.edu/courses/29726/assignments/172400

Week -4

  1. Classification, Logistic Regression - 1 (Classification, Binary Response & Logistic Regression, Bayes classifier, Model Diagnostics in Logistic Regression, Multiple Logistic Regression, Classification Boundaries)
  2. Classification, Logistic Regression - 2 ( Multiple Logistic Regression, Classification Boundaries, ROC curve, k-NN for Classification, Choice of k, k-NN with Multiple Predictors)
  3. Missing Data (Dealing with Missing Data, Naively handling missingness, Types of Missingness, Sources of Missingness, Imputation Methods, Handling missing data)
  4. Decision Trees (Geometry of Data, Interpretable Models, Decision Trees, Numerical vs Categorical Attributes, Splitting Criteria, Gini Index, Information Theory, Entropy, Stopping Conditions & Pruning)
  5. Code Camp-4 (Logistic Regression, ROC, Data Imputation: Automated Breast Cancer Detection:Homework 4) https://canvas.harvard.edu/courses/29726/assignments/175288

datascience's People

Contributors

kvmuralikrishna1993 avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.