BigDataSpark

Big Data Using Apache Spark

###COURSE CONTENT

Week 1: Big Data and Data Science

    Introduction to Big Data and Data Science - learn about big data and see examples of how data science can leverage big data
    Performing Data Science and Preparing Data - explore data science definitions and topics, and the process of preparing data
    Setting up the Course Software Environment  - download and install the course software, run your first Apache Spark notebook, and submit your first assignment

Week 2: Introduction to Apache Spark

    Big Data, Hardware Trends, and the History of  Apache Spark - discuss big data and hardware trends, and learn about the history of Apache Spark
    Spark Essentials - learn about Spark's Resilient Distributed Datasets, transformations, and actions 
    Lab 1: Learning Apache Spark  - perform your first course lab where you will learn about the Spark data model, transformations, and actions, and write a word counting program to count the words in all of Shakespeare's plays

Week 3: Data Management

    Semi-Structured Data - explore the concept of semi-structured data and how tabular data is handled in Spark
    Structured Data - learn about structured data, the relational data model, SQL, and joins in SQL and Spark 
    Lab 2: Web Server Log Analysis with Apache Spark  - use Spark to explore a NASA Apache web server log in the second course lab

Week 4: Data Quality, Exploratory Data Analysis, and Machine Learning

    Data Quality - learn about the challenges of data quality and cleaning
    Exploratory Data Analysis - understand the statistics of Exploratory Data Analysis and data distributions
    Machine Learning - learn about Spark's machine learning library, mllib 
    Lab 3: Text Analysis and Entity Resolution - perform text analysis and entity resolution on Google and Amazon product listings using Spark in the third course lab

Week 5: Data Management

    Lab 4: Introduction to Machine Learning with Apache Spark - use Spark's mllib Machine Learning library to perform collaborative filtering on a movie dataset in the fourth course lab

Useful Links:

The US National Institute of Standards and Technology has an excellent primer on Exploratory Data Analysis

The five-number summary is a [descriptive statistic] (https://en.wikipedia.org/wiki/Descriptive_statistics) that provides information about a set of observations. It consists of the five most important sample percentiles:

The sample minimum (smallest observation)
The lower quartile or first quartile
The median (middle value)
The upper quartile or third quartile
The sample maximum (largest observation)

You can compare the five-number summaries of multiple observations using a box plot:

pombredanne / bigdataspark Goto Github PK

bigdataspark's Introduction

BigDataSpark

Useful Links:

bigdataspark's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent