12 weeks, 2 hours / per week
20 min per episode, so six episodes per week.
This course will cover:
***** Spark MLlib
**** ML Pipeline and GraphX
*** Spark Core and Spark SQL
** Spark Streaming
* Scikit-learn for reference.
- Advanced Analytics with Spark
- Machine Learning with Spark
- The Lion Way: Machine Learning plus Intelligent Optimization
- Others...
- Spark ABC
- Machine learning ABC
- Graph Computing ABC
- Demos for Spark, MLlib, and GraphX
- Logistic regression
- Linear regression
- SVM
- LASSO
- Ridge regression
- Applied demos such as Handwritten digits recognition, etc.
- Recommendation ALS
- Singular Value Decomposition
- The implementation in both MLlib and Mahout
- Applied demo of recommendation with PredictionIO.
- k-means
- LDA
- Applied demo of geo-location clustering and topic modeling
- Lambda Architecture
- Parameter Server
- Several algorithms from Freeman labs
- Applied demo such as the zebrafish experiment
- Pipeline of Scikit-learn
- Pipeline of Spark (DataFrame, ML Pipeline, etc.)
- Applied demo (TBD)
- Scientific computing and Notices from Matrix Computation
- Matrix libs (in C/Fortran and Java)
- Matrix in MLlib
- Applied demo (TBD)
- Graph computing and libs
- revisit LDA, ALS
- Applied demo such as community detection for food network/recommendation.
- Tree model
- Random forest
- Ensemble in Kaggle and practice
- Applied demo for ensemble
- Evaluation methods
- Implementations in MLlib
- Online / Offline evaluations
- Commonly used optimization algorithms
- Sequential gene of optimization algorithms
- BSP model to BSP+ model to SSP
- Future ways?
- One, two, three of practical ML
- Rethink of practical machine learning
- How to build a great machine learning system?
- Compare with Mahout / Oryx2 / VM / ...
| Chapter | Topic | Algorithms | Dataset | Source | |:-----:|:-----:|:-----:|:-----:|:-----:|:-----:| | 2 | Record Linkage | Entity resolution, record dedup, merge-and-purge, list washing | Some business data such as TCPDS | UCI ML repo | | 3 | Recommending | ALS | Who plays what or who rates what | Audioscrobbler | | 4 | Predicting Forest Cover | Decision Tree | The type of forest covering parcels of land in Colorado | UCI ML repo | | 5 | Anomaly detection in network traffic | K-means | Network intrusion data | KDD Cup 1999 Dataset | | 6 | Understanding wikipedia | Latent Semantic Analysis, SVD, TF-IDF, etc | wikipedia texts | wikipedia | | 7 | Analyzing Co-occurrence Networks | Massive graph algorithms in GraphX | MEDLINE citation index | US National Library of Medicine | | 8 | Geo and Temporal data analysis | Building sessions | New York Taxicab Data | New York City Taxi and Limousine Commission | | 9 | Estimating Finacial Risk | Monte Carlo Simulation | Stock Data | Yahoo! | | 10 | Analyzing Genomic Data | Massive genome analysis algorithms | Genome data | NCBI | | 11 | Analyzing Neuroimaging Data | Thunder | Images of zebrafish brains | Thunder repository |