Big Data Using Apache Spark
###COURSE CONTENT
Week 1: Big Data and Data Science
Introduction to Big Data and Data Science - learn about big data and see examples of how data science can leverage big data
Performing Data Science and Preparing Data - explore data science definitions and topics, and the process of preparing data
Setting up the Course Software Environment - download and install the course software, run your first Apache Spark notebook, and submit your first assignment
Week 2: Introduction to Apache Spark
Big Data, Hardware Trends, and the History of Apache Spark - discuss big data and hardware trends, and learn about the history of Apache Spark
Spark Essentials - learn about Spark's Resilient Distributed Datasets, transformations, and actions
Lab 1: Learning Apache Spark - perform your first course lab where you will learn about the Spark data model, transformations, and actions, and write a word counting program to count the words in all of Shakespeare's plays
Week 3: Data Management
Semi-Structured Data - explore the concept of semi-structured data and how tabular data is handled in Spark
Structured Data - learn about structured data, the relational data model, SQL, and joins in SQL and Spark
Lab 2: Web Server Log Analysis with Apache Spark - use Spark to explore a NASA Apache web server log in the second course lab
Week 4: Data Quality, Exploratory Data Analysis, and Machine Learning
Data Quality - learn about the challenges of data quality and cleaning
Exploratory Data Analysis - understand the statistics of Exploratory Data Analysis and data distributions
Machine Learning - learn about Spark's machine learning library, mllib
Lab 3: Text Analysis and Entity Resolution - perform text analysis and entity resolution on Google and Amazon product listings using Spark in the third course lab
Week 5: Data Management
Lab 4: Introduction to Machine Learning with Apache Spark - use Spark's mllib Machine Learning library to perform collaborative filtering on a movie dataset in the fourth course lab
The US National Institute of Standards and Technology has an excellent primer on Exploratory Data Analysis
The five-number summary is a [descriptive statistic] (https://en.wikipedia.org/wiki/Descriptive_statistics) that provides information about a set of observations. It consists of the five most important sample percentiles:
- The sample minimum (smallest observation)
- The lower quartile or first quartile
- The median (middle value)
- The upper quartile or third quartile
- The sample maximum (largest observation)
You can compare the five-number summaries of multiple observations using a box plot: