This course is about learning from data, in order to gain useful predictions and insights. Separating signal from noise presents many computational and inferential challenges, which we approach from a perspective at the interface of computer science and statistics. Through real-world examples of wide interest, we introduce methods for five key facets of an investigation:
● data munging/scraping/sampling/cleaning in order to get an informative, manageable data set
● data storage and management in order
● to be able to access data - especially big data - quickly and reliably during subsequent analysis
● exploratory data analysis to generate hypotheses and intuition about the data
● prediction based on statistical tools such as regression, classification, and clustering and
● communication of results through visualization, stories, and interpretable summaries.
● Use Python and other tools to scrape, clean, and process data
● Use data management techniques to store data locally and in cloud infrastructures
● Use statistical methods and visualization to quickly explore data
● Apply statistics and computational analysis to make predictions based on data
● Apply basic computer science concepts such as modularity, abstraction, and encapsulation to data analysis problems
- Introduction to Data Science (What is Data Science and Data Science Process, What are Data, Data Sources, Types of Data, Data Formats, Messy Data)
- Data Exploration (Basics of Sampling, Biases in Sampling, Measures of Centrality, Measures of Spread, Principles of Visualizations, Histograms, Scatter plots, Pie Charts, Stacked Area Graphs, Box Plots, 3D data)
- Data Engineering (Tabular Data, Pandas and Scraping)
- Exploratory Data Analysis and Effective Data Visualizations
- Code Camp-1 (Web Scraping and EDA: Scraping Wikipedia Billboard for Top 100 : Homework 1) https://canvas.harvard.edu/courses/29726/assignments/167915
- Introduction to Regression (Statistical modeling, predicting a variable, Regression Vs Classification, Error, Loss functions, Line of Best Fit, K nearest neighbours)
- Linear Regression (Linear Regression, Comparing Models, Evaluating Model, Model Uncertainty, Bootstrapping for estimating sampling error, Model Fitness R^2, Training Vs Testing sets
- Multiple Linear Regression - 1 (Multiple Linear Regression, Evaluating Significance of Predictors, Hypothesis testing, R^2, Information Criteria, AIC/BIC)
- Multiple Linear Regression - 2 (Comparing parametric and nonparametric models, Multiple Linear Regression with Interaction Terms, Polynomial Regression)
- Code Camp-2 (Multiple Linear Regression: Forecasting Bike Sharing Usage: Homework 2) https://canvas.harvard.edu/courses/29726/assignments/172398
- Model Selection (overfitting, model selection, variable selection, Cross Validation)
- Regularization (Bias Vs Variance, Regularization: LASSO and Ridge
- PCA (High Dimensionality, Dimensionality Reduction, PCA, Using PCA for Regression)
- Visualization for Communication ( Visualization Goals, Effective Visualizations, Tools for interactive graphics, Structure of Communicative Graphics, Application to modeling)
- Code Camp-3 (Model Selection, Regularization, PCA: Forecasting Bike Sharing Usage: Homework 3) https://canvas.harvard.edu/courses/29726/assignments/172400
- Classification, Logistic Regression - 1 (Classification, Binary Response & Logistic Regression, Bayes classifier, Model Diagnostics in Logistic Regression, Multiple Logistic Regression, Classification Boundaries)
- Classification, Logistic Regression - 2 ( Multiple Logistic Regression, Classification Boundaries, ROC curve, k-NN for Classification, Choice of k, k-NN with Multiple Predictors)
- Missing Data (Dealing with Missing Data, Naively handling missingness, Types of Missingness, Sources of Missingness, Imputation Methods, Handling missing data)
- Decision Trees (Geometry of Data, Interpretable Models, Decision Trees, Numerical vs Categorical Attributes, Splitting Criteria, Gini Index, Information Theory, Entropy, Stopping Conditions & Pruning)
- Code Camp-4 (Logistic Regression, ROC, Data Imputation: Automated Breast Cancer Detection:Homework 4) https://canvas.harvard.edu/courses/29726/assignments/175288